- 77  77 

AD-A129  438 

FAULT  TOLERANCE  RELIABILITY  AND  TESTABILITY  FOR 

1// 

\ 

DISTRIBUTED  SYSTEMS(U)  SOHAR  INC  LOS  ANGELES  CA 
H  HECHT  ET  AL.  FEB  83  RADC-TR-83'36  F30602-81-C- 

0133 

UNCLASSIFIED 

F/G 

17/2 

NL 

MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS-1963-A 


1 


AD  A129438 


,r  ■■  4i-'/^rv%  ,  V 


•  ‘  ^  /,  I'  •«  -  -  -J  V'-  • 


Wfe; 


ISSSI®: 

*/'£*k'  -  S&ig&K' 
IBEPteJ 


«£&  i"  •■“ ■•  v  *fts  Mis  dtamCM^gg'^Mfe 

‘  * 


Sg| 

;tWl 


•^’lIV^W  &T'J*^  ;<3 


-■^<v 


^?4#&?S5Siv,Wii;>!;5a  *  «50Sj^3^,*  r&<MH 

•i-r  *>:  ?  ‘£%kW  V'  t  H  ft '  .  V  <V 


•7,  &&c.  <B-,--i.  -.? 

'  *#M 


>y?\ 


7\S*  •  ~i-.  :S*VA 

“iifi;::  :■■*■; ■ 


.-*  ' 


02  7 


m 


UNCLASSIFIED 


|  REPORT  DOCUMENTATION  PAGE 

rxao  mmvcnotn 

BKFORK  COWPLETWG  POM 

I  A*Cl  Alt  NT's  CATALOG  NUMaZm 

«.  TIM  fl  hull) 

FAULT  TOLERANCE,  RELIABILITY  AND  TEST¬ 
ABILITY  FOR  DISTRIBUTED  SYSTEMS 

a  rvfs  of  mcpomt  •  f cmoo  covceco 
Interim  Report 

Sep  81  -  Oct  82 

ACNPOMMNO  070.  NSfOOT  NUMSCK 

N/A 

!i  SJTEEVS  ' 1  ^ 

Herbert  Hecht 

Myron  Hecht 

With  contributions  bv  K.  H.  Kim 

F30602-81-C-0133 

».  MftfOMHNO  OftOAMlZATIOM  MAMC  ANO  A0OMCSS 

SoHaR  Incorporated 

1040  S.  LaJolla  Ave 

IS.  fflOOMAM  CLCMCJf  T.  aoojIct.  TASK 
AMA  A  *OKK  UMIT  NUHOCMS 

62702F 

23380245 

11.  CONTMLUNO  Office  N AMC  AND  AOONSSS 

Rome  Air  Development  Center  (RBET) 

Grlfflss  AFB  NY  13441 

it.  nefoov  oats 

February  1983 

ia  MUMOCft  Of  f  ASKS 

84 

11 '  USlWUil  A4t*£v  mAmI  a  aoomuv/i  ik—i  imm  Cuumuiimt  otn*m> 

Same 

«a  seeumTv  ciAsa  ( w  am* 

UNCLASSIFIED 

iia.  ojc^;icat,o»/ooW.Ma61ho 

1C  WWBlSreS  WCTEBT  <-  TBS  ICSg - - 

Approved  for  public  release;  distribution  unlimited. 


17.  OtSTftiauTIQN  STATCMCMT  (ml  mu  —«>■»>  mtmrmU  In  SmI  M.  II  mtrnrnM  Hum  Huumt) 

Same 


u.  «moimr*«Ymni 

RADC  Project  Engineer:  Heather  Dussault  (RBET) 


IA.  KSV  »om  (CmmUmm  urn  luumuu  u&u  II  muuuuumr  mm  lUmmHtr  Ur  UuS  mmHm) 

Fault  Tolerance  Fault  Isolation 

Distributed  Systems  Fault  Tolerant  Design 

Reliability  Improvement  Networks 

Testability  Effectiveness 

SmSS7SSST(SSSSSSl!S^u7umSuuu!Snrmu!!uam^tTSlm!&T^»luuli  mmrnm)  "* 

~  A  growing  need  exists  for  Improved  fault  tolerance,  reliability,  and 
testability  in  distributed  systems  which  support  Command,  Control  and 
Communications  and  Intelllgenca  (C?I)  activities.  The  objective  of  this 
study  i»  to  provide  a  foundation  for  the  development  of  design  measures 
and  guidelines  for  the  design  of  fault  tolerant  systems.  Taxonomies  of 
fault  tolerance  and  distributed  systems  ara  davaloped,  and  typical  Air 
Force  C^I  needs  In  both  fault  tolerant  and  distributed  computer  systems  - 
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are  characterized.  Reliability  and  availability  experience  for  ten 
typical  computer  systems  is  reported  in  a  consistent  format,  and  the 
data  are  analyzed  from  the  perspective  of  a  distributed  system  user. 
Previous  work  on  the  identification  of  problems  in  distributed  systems 
and  design  methods  for  their  solutions  is  discussed.  Key  issues  in  the 
design  of  fault  tolerant  distributed  systems  are  identified.  Fault 
location  techniques  for  specific  computer  configurations  found  in  C^I 
applications  are  described  in  detail.  The  study  is  a  continuing  effort 
and  a  comprehensive  design  methodology  will  be  developed  based  upon  the 
material  presented  in  this  report. 
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This  report  covers  the  first  half  of  a  study  of  fault  tolerance,  reliability, 
and  testability  In  distributed  systems  that  support  command,  control, 
communications,  and  Intelligence  (C3I)  activities.  The  study  Is  motivated  by 
the  need  for  continuous  availability  of  the  computing  function  In  the  C3I 
applications  and  by  the  Increasing  utilization  of  distributed  computing  In 
this  field.  The  study  Is  Intended  to  provide  a  framework  for  the 
characterization  of  fault  tolerance  provisions,  their  evaluation  against  the 
needs  of  C3I  activities,  and  recommendations  for  improvements  In  fault 
tolerance,  reliability,  and  testability  where  these  are  warranted. 

The  methodology  utilized  Includes  reviews  of  the  general  literature  of  fault 
tolerant  and  distributed  computing  with  particular  emphasis  on  reports 
generated  by  DoD  agencies  related  to  C3I  activities)  on-site  reviews  of  the 
reliability  experience  of  selected  DoD  facilities;  collection  of  pertinent 
reliability  data  from  non-DoD  facilities  where  these  can  be  obtained;  and 
original  research  In  areas  not  adequately  covered  by  prior  Investigations. 

As  part  of  the  effort  reported  here,  taxonomies  of  fault  tolerance  and  of 
distributed  systems  were  developed  (Sections  2.1  -  2.3),  and  functional  needs 
of  C3I  activities  for  fault  tolerance  have  been  characterized  (Section  2.4). 
The  reliability  and  availability  experience  of  ten  typical  computer  systems 
(Including  two  Air  Force  applications)  Is  reported  In  a  consistent  format,  and 
the  data  are  interpreted  from  the  point  of  view  of  a  user  of  distributed 
systems  (Section  3).  A  framework  for  the  Investigation  of  design 
methodologies  for  distributed  systems  Is  described  (Section  4.1),  and  previous 
work  Is  summarized  and  key  issues  In  design  are  Identified  (Sections  4.2  and 
4,3).  Fault  location  techniques  applicable  to  specific  computer 
configurations  found  In  C3I  applications  are  described  In  detail  (Section  5). 

The  study  Is  continuing,  and  a  comprehensive  design  methodology  will  be 
developed  based  on  the  work  reported  here. 
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PREFACE 


A  growing  need  exists  for  Improved  fault  tolerance,  reliability, 
and  testability  In  distributed  systems  which  support  command,  control, 
communications,  and  Intelligence  (C3I)  activities.  This  Interim  re¬ 
port  Identifies  those  system  functional  needs,  design  motivations,  and 
key  design  Issues,  and  presents  a  logic  which  can  be  used  for  compar¬ 
ative  analysis  and  evaluation  of  fault  tolerant  distributed  system 
reliability,  testability,  and  effectiveness.  The  results  presented 
are  based  upon  current  operational  experience  and  previous  studies  In 
the  areas  of  fault  tolerant  design  and  distributed  computing.  This 
report  Is  Intended  to  provide  a  foundation  for  the  development  of 
measures  and  guidelines  for  the  design  and  evaluation  of  fault  toler¬ 
ance,  reliability,  and  testability  in  distributed  systems. 


SECTION  1  —  INTRODUCTION 


This  Is  an  Interim  report  generated  on  RADC  Contract  F30602-81-C-0133, 
Re  1 1  ab  1 1 1  ty/Testab  1 1 1  ty/Des  I  gn  Considerations  for  Fault  Tolerant  Systems  by 
SoHaR  Incorporated.  The  study  Is  particularly  aimed  at  applications  In  the 
command,  control,  communications,  and  Intelligence  area  (C3I).  At  the  time  of 
the  writing  of  this  report,  approximately  one-half  of  the  twenty-eight  month 
duration  of  the  project  had  elapsed.  The  major  goal  of  the  present  report  Is 
to  describe  the  fault  tolerance,  reliability  and  testability  found  In  present 
C3I  systems,  or  In  systems  that  are  technically  similar  to  those  In  the  C3I 
field. 

Distributed  systems  are  coming  into  Increasing  use  throughout  the  digital 
processing  field  because  of  the  flexibility,  performance,  and  reliability 
advantages  which  they  offer.  Examples  of  benefits  in  flexibility  are  (a)  the 
ability  to  route  computing  tasks  to  the  most  suitable  processor  (as  contrasted 
with  the  local  processor  that  may  not  be  very  efficient  for  a  given  task),  (b) 
the  ability  to  add  processors  Incrementally  as  the  computing  load  Increases, 
and  (c)  the  ability  to  Introduce  technical  advances  gradually,  one  processor 
at  a  time,  while  retaining  existing  computers  on-line,  thus  avoiding  the  major 
software  and  systems  problems  that  arise  when  dedicated  computers  are  replaced 
by  newer  models.  Performance  (throughput  of  computing  tasks)  Is  enhanced 
because  distributed  computing  allows  any  temporarily  Idle  computer  to  be 
utilized  for  sharing  the  load  at  a  busy  site,  and,  similarly,  reliability  Is 
Improved  because  other  processors  can  be  utilized  to  take  up  the  load  of  a 
failed  one  uni  1 1  It  Is  repaired. 

All  of  these  benefits  are  particularly  welcome  In  C3I  applications.  Major  new 
techniques  are  being  Introduced  In  several  functional  areas,  and  the 
flexibility  offered  by  distributed  systems  Is  highly  desirable  to  support  a 
smooth  transition  to  these.  The  performance  advantages  are  valuable  because 
of  the  high  ratio  of  peak  load  to  average  load,  and  the  resulting  overslzfng 
(In  terms  of  average  load)  that  Is  necessary  if  dedicated  computers  are  used 
at  each  site.  The  reliability  advantages  of  distributed  systems  translate 
directly  Into  survivability,  perhaps  the  most  highly  prized  attribute  In  a  C3I 
system. 

A  number  of  other  RADC  projects  address  architectural  aspects  of  distributed 
systems  and  configurations  for  specific  applications.  The  present  study  Is 
particularly  concerned  with  those  aspects  of  reliability,  fault  tolerance,  and 
testability  that  are  applicable  to  broad  classes  of  systems.  The  definitions 
and  classifications  of  systems  presented  here,  the  experience  on  current 
systems,  and  the  techniques  described  In  this  report  will  be  analyzed  and 
Integrated  (together  with  additional  Information)  during  the  remainder  of  this 
study.  The  final  report  will  constitute  a  guideline  for  achieving 
reliability,  fault  tolerance,  and  testability  In  the  design  of  distributed 
systems  for  C3I  applications. 

Section  2  of  this  report  Introduces  the  terminology  for  distributed  systems 
and  fault  tolerance,  presents  classification  schemes  (taxonomies)  for  both  of 
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these  concepts,  and.  In  the  final  part  of  the  section,  discusses  the 
objectives  and  problems  In  achieving  fault  tolerance  In  distributed  systems. 

Section  3  deals  with  the  reliability  of  current  systems  that  employ  components 
or  techniques  that  will  be  applicable  to  distributed  systems  In  the  future. 
In  none  or  the  Instances  for  which  data  are  presented  do  present  systems  meet 
all  of  the  criteria  for  a  truly  distributed  system  that  were  described  In 
Section  2.  Nonetheless,  the  examination  of  the  current  data  Is  essential 
because  It  Is  the  basis  from  which  planning  for  the  future  must  proceed.  Due 
to  the  cooperation  received  from  a  number  of  Government  and  private 
organizations,  the  long-term  (mostly  one  year)  reliability  experience  of  ten 
systems  Is  presented  In  a  consistent  format,  with  allocation  of  failures  to 
hardware,  software,  and  other  causes.  This  collection  of  data  may  also  be  of 
Interest  to  readers  outside  the  field  of  distributed  systems.  The  final  part 
of  Section  3  presents  a  prel  Imlnary  Interpretation  of  these  data  for  future 
C3I  systems. 

Section  4  discusses  design  Issues  and  methods  In  distributed  systems.  That 
part  of  the  report  Is  primarily  Intended  to  define  the  constraints  within 
which  guidelines  for  fault  tolerance,  reliability  and  testability  must  be 
developed.  The  motivation  of  the  developer/user,  the  current  state  of 
supporting  technologies  (primarily  In  networking),  and  the  approaches  of 
established  design  methodologies  are  reviewed.  The  contributions  made  In  the 
development  of  specific  distributed  systems  are  summarized  In  the  final  part 
of  that  section. 

Section  5  contains  an  example  of  a  formal  fault  location  technique  for  several 
configurations  that  were  repeatedly  encountered  In  current  military  systems 
that  employ  distributed  processing  of  modest  scope  (a  limited  number  of 
computers).  Fault  location  Is  a  significant  aspect  of  the  testability  of 
distributed  systems.  Prof.  K.  H.  Kim,  a  consultant  to  SoHaR  on  this  effort, 
originated  the  concepts  used  In  Section  5  and  generated  the  program  design  for 
fault  location  that  Is  presented  In  the  Appendix.  Sections  4  and  5  constitute 
examples  of  Individual  Issues  and  techniques  that  must  be  mastered  for  the 
successful  application  of  fault  tolerance  In  distributed  systems. 


2 


SECTION  2  —  FAULT  TOLERANCE  AND  DISTRIBUTED  SYSTEMS 


This  section  Introduces  basic  concepts  of  fault  tolerance  and  distributed 
systems  as  a  foundation  for  the  remainder  of  the  report.  Section  2.1  defines 
key  terms  that  are  used  throughout  the  report,  section  2.2  contains  a  taxonomy 
of  fault  tolerance  measures  applicable  to  distributed  systems,  and  section  2.3 
describes  the  classification  of  distributed  systems.  Finally,  section  2.4 
describes  functional  needs  In  fault  tolerance  and  distributed  systems. 


2.1.  DEFINITION  OF  KEY  TERMS 

2.1.1.  Error,  Failure,  and  Fault 

The  terms  error,  fault,  and  failure  are  often  used  Interchangeably  In 
technical  literature.  However,  with  the  Introduction  of  systems  that  continue 
to  operate  when  components  cease  to  perform  as  specified,  distinctions  among 
various  levels  of  failures,  causes,  and  effects  become  necessary.  The 
definitions  of  these  terms  used  In  this  report  are  shown  In  figure  2-1. 

An  error  exists  when  the  output  of  a  computer  system  does  not  meet  user 
requirements,  or  when  the  computer  Is  In  a  state  that  does  not  support  user 
needs.  The  system  Itself  has  fal  led .  I.e.,  execution  of  a  program  on  this 
system  has  resulted  In  a  failure.  To  cause  this  failure,  a  fault  must  have 
been  present  In  either  the  hardware  or  the  software.  Hardware  faults  are 
frequently  caused  by  deterioration  of  Initially  fault-free  devices.  Because 
random  processes  contribute  to  the  deterioration  these  hardware  faults  are 
said  to  produce  random  failures.  Software  faults,,  as  well  as  hardware  design 
faults,  have  been  present  from  the  time  the  system  was  placed  Into  service. 
They  have  not  resulted  In  observed  errors  because  of  lack  of  observation  or 
because  the  external  event  or  trigger  to  activate  them  had  not  been  present. 

2.1.2.  Hardware  Fault  Tolerance 

Hardware  fault  tolerance  Is  the  ability  of  the  system  hardware  to  continue  a 
specified  level  of  operation  In  the  presence  of  one  or  more  hardware  faults. 
This  ability  Is  most  often  achieved  by  the  use  of  replicated  components.  The 
definition  Implies  that  the  system  must  continue  to  function  as  specified  for 
all  Inputs;  thus,  a  system  capable  of  operating  In  a  degraded  mode  In  which  a 
restricted  set  of  inputs  can  be  processed  Is  not  fault  tolerant. 

2.1.3.  Software  Fault  Tolerance 

Software  fault  tolerance  Is  the  ability  of  a  system  to  provide  uninterrupted 
operation  In  the  presence  of  program  faults  through  multiple  Implementations 
of  a  given  functional  process.  Although  this  definition  was  first  proposed  by 
Elmendorf  more  than  a  decade  ago  CELME72],  there  remains  much  Inconsistency 
In  the  usage  of  this  term  In  the  software  engineering  literature.  For 
example,  other  techniques  such  as  fault  containment  and  robustness  have  also 
s  been  characterized  as  fault  tolerant  despite  the  fact  that  they  do  not  provide 

for  alternate  and  Independent  execution  of  a  function. 
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FIGURE  2  -  t  BASIC  FAILURE  MODEL 
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2.1.4.  Distributed  Systems 

The  term  "distributed"  when  used  In  conjunction  with  "processing"  or  "system" 
hes  become  one  of  the  vaguest  terms  In  the  field  of  computing-  In  an 
Introduction  typical  of  many  papers  on  distributed  processing,  Enslow  [ENSL78J 
spoke  of  the  problem  as  follows: 

"Words  have  only  one  purpose  In  a  technical  context  —  the  transmission 
of  Information.  When  they  fall  to  do  that,  they  lead  to  confusion  and 
misunderstanding.  'Distributed  data  processing'  and  'distributed 
processing'  Illustrate  that  axiom.  Like  many  other  words  In  the  lexicon 
of  the  computer  professional,  these  have  become  cliches  through  overuse, 
losing  much  of  their  original  meaning  In  the  process." 

Since  the  publication  of  that  article,  an  increasing  number  of  vendors,  system 
analysts,  and  users  have  adopted  the  term  with  a  resultant  further  corruption 
In  Its  meaning.  Thurber's  CTHUR803  definition  of  a  distributed  processing 
system,  which  shall  be  used  In  this  report,  consists  of  a  set  of  six 
conditions: 

s' 

1.  The  system  has  at  least  two  processors  (processing  elements;  host^ 
etc  ) 

2.  Each  processor  has  a  main  storage  module  and  other  memory  subsystems 
as  required. 

3.  There  Is  no  system-wide  shared  memory 

4.  There  Is  a  communications  medium  termed  the  "communications 
subnetwork" 

3.  All  process  communication  occurs  via  messages  between  processors  over 
the  communications  subnetwork 

6.  A  message  Is  modeled  as  a  stream  of  bits  broken  Into  three  major 
sections:  header.  Information  text,  and  trailer. 

This  definition  was  chosen  after  examining  some  of  the  major  Air  Force  C3I 
systems  In  which  dispersed  computers  perform  asssoclated  functions  and  are 
linked  with  various  types  of  communication  lines.  The  dispersed  computers  may 
be  grouped  Into  tightly  coupled  networks  In  which  one  or  more  malnrrame 
computers  controls  an  array  of  sensors,  displays,  or  other  devices.  Fault 
tolerance  measures  can  be  applied  to  both  the  links  between  these  computing 
centers  and  within  the  centers  themselves. 

Most  currently  operational  'arge  distributed  systems  consist  of  a  "main" 
computer  Installation  and  "satellite"  nodes.  The  main  computer  Installation 
contains  one  or  more  computers  which  collectively  control  the  network. 
Failure  of  the  main  computer  will  result  In  either  a  severely  degraded  or 
nonfunctional  network  consisting  only  of  satellite  nodes.  Each  satellite  node 
may  have  one  or  more  computers  which  control  local  (I.  e.,  not  connected  with 

other  satellites)  Input  and  output  devices.  Failure  of^one  or  more  of  these 
network.  Thus,  failure  of  any  single  node  results  Tn  the  loss  of  some  system 
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processing  capabi I Ity,  but  does  not  necessarily  result  In  a  total  system 
failure.  Each  node  on  a  decentralized  system  may  Itself  be  a  centralized 
distributed  system.  For  example,  the  ARPANET  consists  of  a  large  number  of 
mainframe  computers  which  control  an  extensive  local  network  of  satellite 
minicomputers.  Intelligent  terminals,  and  output  devices.  None  of  these 
nodes,  however,  controls  any  other  node  on  the  system. 

This  definition  Is  functional  for  describing  current  Air  Force  distributed 
systems  from  the  scale  of  fighter  aircraft  avionics  to  the  scale  of  the  WWMCCS 
network.  It  Is  also  consistent  with  the  current  use  of  the  term  In  the 
commercial  computing  Industry.  Finally,  networks  which  conform  to  more 
limited  definitions  of  distributed  processing  such  as  CENSL78  and  ENSL81H  are 
also  Included  In  this  definition. 


2.2.  TAXONOMY  OF  FAULT  TOLERANCE  MEASURES  FOR  DISTRIBUTED  SYSTEMS 


Taxonomies  for  the  classification  of  both  fault  tolerance  methods  and  network 
architectures  are  necessary  to  partition  the  topic  of  fault  tolerance  In 
distributed  systems  Into  homogeneous  and  manageable  subtopics.  The  objective 
of  this  section  Is  to  develop  a  framework  for  classifying  fault  tolerance 
measures  for  distributed  systems.  The  basis  of  this  taxonomy  Is  the 
conceptualization  of  a  computer  network  as  nodes  and  links.  The  node  Is 
defined  as  everything  on  the  computer  side  of  the  I/O  buffer,  and  the  link  Is 
defined  as  the  network  system  beyond  the  I/O  buffer  until  that  of  the  next 
node. 

Figure  2-2  shows  the  taxonomy.  Fault  tolerance  for  distributed  systems  can  be 
Implemented  either  with  or  without  reconfiguration  of  the  network.  The  left 
hand  side  of  the  trqe  shows  fault  tolerance  Implementation  with 
reconfiguration  consisting  of  node  substitution,  link  substitution,  or  both. 
Reconfiguration  Is  the  highest  level  of  fault  toterpnce  for  a  distributed 
system,  and  requires  a  network  management  system.  Commercially  available 
protocols  and  network  architectures  allow  for  the  reconfiguration  (I.  e.,  the 
disconnection  or  reconnection  while  the  rest  of  the  network  remains 
unaffected)  of  secondary  network  processing  elements.  However,  most  work  on 
network  reconfiguration  after  failure  of  principal  processing  nodes  has  been 
on  either  a  theoretical  level  or  on  experimental  systems. 

The  right  hand  side  of  the  tree  shows  how  fault  tolerance  Is  applied  without 
network  reconfiguration.  In  this  case,  recovery  after  a  failure  Is  achieved 
by  returning  all  nodes  and  links  to  an  operational  status.  Restoration  of 
communication  through  a  link  Is  achieved  by  one  of  several  strategies!  time 
based  (e.g.  NAK/ACK)  for  transient  failures,  alternate  types  of  communication 
IlnKS  for  longer  term  faults  (e.g.  use  of  optical  communications  In  the  event 
of  extended  electromagnetic  disturbances),  or  the  use  of  an  alternate  route  of 
the  link  (e.g.  a  replicated  bus  running  along  different  sides  of  an  aircraft 
to  mitigate  the  effects  of  battle  damage).  The  restoration  of  a  node  Is 
achieved  through  computer  hardware  or  software  fault  tolerance  techniques. 
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DISTRIBUTED  SYSTEM  FAULT  TOLERANCE 


FIGURE  2-2  TAXONOMY  FOR  FAULT  TOLERANCE  IN  DISTRIBUTED  SYSTEMS 


This  division  of  fault  tolerance  measures  provides  a  framework  for  further 
discussion  and  analysis,  but  should  not  be  Interpreted  as  meaning  that  use  of 
techniques  In  one  classification  prevents  use  of  techniques  In  another  class. 
For  example,  a  fault  tolerant  computer  must  have  both  hardware  and  software 
fault  tolerance.  Similarly,  achieving  fault  tolerant  communication  links  may 
Involve  time,  type,  and  space  tactics,  and  may  also  Include  use  of  an 
alternate  link  as  an  additional  backup  measure. 


2.3.  TAXONOMIES  OF  DISTRIBUTED  SYSTEMS 

The  taxonomy  used  In  this  work  on  reliability,  maintainability,  and  fault 
tolerance  character! sties  of  distributed  computer  systems  was  tailored  to 
operational  Air  Force  systems  and  analogous  non-mllltary  systems.  It  utilizes 
a  small  number  of  categories  that  are  well  defined  and  within  each  of  which 
uniform  reliability  problems  are  found  and  solutions  can  be  applied.  Section 
2.3.1  describes  other  taxonomy  schemes  In  the  literature  and  explains  why  they 
are  unsuitable  for  the  purposes  of  this  work.  Section  2.3.2  uses  the 
hierarchical  model  of  network  architecture  to  define  two  taxonomies.  Section 
2.3.3  describes  the  lower  level  taxonomy  (designated  as  the  "network" 
taxonomy),  and  section  2.3.4  describes  the  upper  level  "application"  taxonomy. 


2.3.1.  Earlier  Taxonomies  of  Distributed  Systems 

The  taxonomy  one  adopts  for  distributed  systems  Is  determined  by  the  technical 
point  of  view.  Indeed,  so  many  taxonomies  of  distributed  systems  have  been 
presented  In  the  literature  that  It  Is  possible  to  develop  a  classification 
scheme  for  the  taxonomies  CGREE77,  BANN81], 

Most  taxonomies  take  a  topological  approach  by  defining  primitives  (e.g. 
nodes,  switches,  and  links)  and  then  classifying  the  ways  In  which  they  can  be 
linked  together.  The  scheme  most  often  cited  in  the  l  iterature  using  this 
approach  Is  that  of  Anderson  and  Jensen  [ANDE75].  The  topological  view  Is 
problematic  because  other  aspects  of  the  network  can  have  more  Impact  on  the 
system  characteristics.  For  example,  the  computing  system  of  a  major  Los 
Angeles  newspaper  and  that  of  a  C3I  Installation  both  consist  of  two 
replicated  mainframe  computers  and  two  front  end  processors.  Although  these 
systems  are  topologically  similar,  they  are  very  different  In  most  other 
aspects. 

Authors  such  as  Thurber  CTHUR78J  and  Jensen,  et.  a  I  CJENS76D  use  switching 
methods  (I.  e.,  no  switching,  circuit  switching,  message  switching,  and  packet 
switching)  In  their  classification  schemes.  This  approach  falls  to  consider 
differences  In  local  and  long  haul  computer  networks,  and  also  falls  to 
consider  topological  and  operational  aspects  of  a  network.  For  example,  the 
Ethernet  CSH0C82J  Is  topologically  a  linear  bus  system  (I.  e.,  a  variety  of 
nodes  are  connected  to  a  single  communications  channel)  In  which  there  are  no 
discrete  switching  elements.  Thus,  one  might  place  this  network  In  the  no 
switching  classification.  However,  the  Xerox  Implementation  of  a  network 
using  Ethernet  Is  based  on  fixed  length  message  packets,  and  It  Is  therefore 
often  characterized  as  a  packet  switching  network. 
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Other  authors  have  attempted  to  address  the  many  aspects  of  distributed 
systems  by  developing  elaborate  taxonomies.  One  such  classification  scheme 
has  five  levels  and  more  than  sixty  categories  CBANN80*  Unfortunately,  the 
complexity  of  this  approach  makes  It  Impractical. 


2.3.2.  Taxonomy  Used  for  This  Study 

The  taxonomy  for  this  study  regards  distributed  system  architectures  as  a 
series  of  layers,  a  concept  which  has  been  prominent  since  the  development  of 
ARPANET  In  the  late  1960s  CKLEI783,  C 1 S081 □ .  The  top  layers  Include  the 
application  program  and  associated  display  terminal  (or  other  I/O  device) 
control  characters.  Intermediate  levels  Interface  the  applications  program  to 
the  host  computer  and  the  Intercomputer  communication  system  (often  designated 
the  "subnetwork").  The  bottom  layers  control  node  access  to  the  communication 
subnetwork  and  actual  data  transmission. 

Figure  2-3  shows  the  seven  level  ISO  Reference  Model  CIS0813  which  Is  the 
basis  of  much  of  the  current  work  In  distributed  systems.  The  dotted  lines 
between  the  levels  at  the  two  nodes  demonstrate  the  notion  of  transparency,  I. 
e.,  viewing  each  level  as  communicating  directly  with  Its  counterpart  at  the 
receiving  node  without  regarding  the  Intervening  levels  through  which  the  data 
actually  passes  during  the  transfer.  The  desired  result  Is  to  decouple 
application  programs  and  data  from  lower  levels  which  control  the  actual 
operation  of  the  network.  Figure  2-4  shows  how  these  layers  are  aggregated 
for  this  study.  Reliability,  malntalnabl I Ity,  and  fault  tolerance 
characteristics  of  either  grouping  of  layers  will  be  considered  separately, 
with  the  resultant  development  of  two  corresponding  taxonomies. 

Because  the  lower  layers  of  figure  2-4  are  affected  by  the  nature  of  the 
network,  (e.g.  local  or  long  haul,  transmission  rate  and  medium,  performance 
characteristics  of  the  node  hardware,  etc),  the  taxonomy  for  this  level  Is 
designated  as  the  "network"  taxonomy  and  Is  discussed  In  subsection  2.3.3. 
The  upper  layers  are  related  to  the  particular  applications  tasks  and  data, 
and  the  associated  taxonomy  Is  therefore  designated  the  "application  level" 
taxonomy,  which  Is  described  In  sections  2.3.4. 


2.3.3.  NotwarK  Taxonomy 

Figure  2-9  shows  the  overall  structure  of  the  network  taxonomy.  The  left 
branch  Includes  three  types  of  networks  confined  to  a  small  physical  area  and 
the  right  branch  describes  two  types  of  dispersed  networks.  The  distinction 
between  these  branches  Is  the  ratio  of  the  communication  link  bandwidth  to  the 
computing  node  processing  throughput,  a  quantity  which  governs  the  efficiency 
with  which  computing  nodes  can  Interact,  in  localized  networks,  where  links 
have  capacities  In  the  Megabit  per  second  and  higher  range,  this  ratio  Is 
generally  on  the  order  of  a  few  percent.  In  dispersed  networks,  where  long 
distance  links  normally  have  capacities  of  less  than  20,000  bits  per  second. 
It  Is  several  orders  of  magnitude  lower. 


FIGURE  2 
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NETWORK 


The  three  subc I  ass  I f I cat  Ions  of  the  localized  computer  networks  shown  In 
figure  2-5  are  mobile,  collocated,  and  proximity  computer  networks.  The 
latter  designation  refers  to  a  network  whose  nodes  are  located  within  a  radius 
of  approximately  5  km  (the  predominant  definition  of  local  computer  networks 
for  office  automation  purposes). 

Because  of  size  and  weight  limitations,  mobile  localized  networks  generally 
consists  of  mint  and  microcomputers.  The  distinguishing  features  of  this 
category  are  the  limited  maintenance  and  diagnostic  resources  which  may  be 
applied  during  operation,  timing  and  program  length  constraints,  and  the  time 
critical  nature  of  Interruptions  and  recovery  procedures.  The  major 
operational  example  of  an  Air  Force  system  In  this  classification  Is  AWACS.  A 
developmental  microprocessor  based  system  currently  exists  at  Wright  Patterson 
Air  Force  Base  CLARI81].  Network  reliability  problems  Include  failure 
detection.  Isolation,  and  reconfiguration  due  to  either  component  malfunctions 
or  battle  damage.  These  functions  must  be  performed  both  automatically 
because  human  operators  may  be  fully  occupied  with  other  tasks  and  rapidly 
because  of  the  real  time  applications. 

Networks  of  collocated  computers,  the  second  subclassif Icatlon,  are  fixed 
ground  based  computers  Interconnected  to  achieve  higher  system  reliability, 
throughput,  or  task  Integration.  Systems  In  this  category  are  distinguished 
from  the  previous  one  by  fixed  locations  and  the  resultant  relaxing  of 
constraints  on  component  weight  and  size,  diagnostic  provisions,  and 
maintenance  capabilities. 

Reliability  problems  for  this  category  of  distributed  systems  Include  local 
failure  detection.  Isolation,  and  reconfiguration.  In  most  cases,  links 
between  the  computers  do  not  contribute  significantly  to  the  network  failure 
rate. 


Distributed  systems  In  the  proximity  subclassif Icatlon  are  ground  based 
networks  In  which  nodes  are  located  In  the  same  general  vicinity  but  are  not 
physically  adjacent.  These  systems  are  currently  designated  as  "local  area 
networks".  The  major  distinctions  of  this  category  are  the  Internode  distance 
and  the  use  of  serial  (rather  than  parallel)  communication  on  the  links.  Two 
examples  of  Air  Force  systems  are  the  FILAN  specification  now  being  developed 
at  RADC  [FILA82J  and  Ballistic  Missile  Defense  (BMD)  systems  now  being 
developed  by  the  U.  S.  Army  CALF0813.  A  third  example  Is  the  combination  of 
computers  on  an  AWACS  aircraft,  on  fighter  aircraft  being  controlled  by  the 
AWACS,  and  at  a  ground  station.  Network  reliability  problems  for  these 
systems  Include  remote  failure  detection  and  isolation  (the  malfunctioning 
node  may  be  Inaccessible  because  of  distance  or  battle  considerations), 
reconfiguration,  and  disconnection  of  a  "babbling  node".  As  a  consequence  of 
the  Increased  Internode  distance,  problems  on  the  link  related  to  noise  and 
signal  propagation  time  must  also  be  considered. 

Figure  2-5  shows  two  subclassif I cat Ions  for  dispersed  computer  networks t  (1) 
dispersed  computers,  I.  e.,  a  network  with  large  computers  scattered  over  a 
wide  geographical  area,  and  (2)  dispersed  terminals,  I.  e.,  a  network  with  one 
or  more  computers  located  at  a  central  site  which  support  sensors,  terminals, 
and  other  specialized  devices  over  a  wide  geographical  area. 


12 


The  prime  military  example  of  the  first  subclass  I  float  Ion  Is  the  Worldwide 
Military  Cownand  and  Control  System  (WWMCCS),  a  system  which  Includes  sites  In 
Europe,  North  America,  the  Pacific,  and  Asia  [GA078].  Other  example  systems 
which  were  surveyed  are  shown  In  table  3-1.  Network  reliability  problems 
Include  ensuring  the  Integrity  of  communication  links  to  other  computers, 
error  detection  and  correction  of  the  transmitted  data,  remote  failure 
detection  and  Isolation  of  both  computers  and  communication  links,  and 
establishment  of  alternate  links  to  disconnected  nodes. 

The  distinguishing  characteristic  of  the  second  subc  I  asss  I  f icat  Ion  of 
dispersed  systems  Is  the  presence  of  geographically  separated  terminals  (and 
other  I/O  devices)  and  a  central  computing  facility.  If  the  computing 
facility  contains  more  than  one  local  computer,  that  part  of  the  network  falls 
Into  the  localized  classification  while  the  portion  concerned  with  the  remote 
terminals  falls  In  this  category.  Military  examples  of  such  systems  include 
NORAD  and  PAVE  PAWS  CGA078]  which  have  both  multiple  collocated  mainframe 
computers  and  links  to  remote  sites.  Network  reliability  problems  Include 
ensuring  the  Integrity  of  the  communications  link  to  the  terminals,  error 
detection  and  correction  of  transmitted  data,  remote  failure  detection  and 
Isolation  of  communication  links  and  terminals,  and  rerouting  of 
communications  to  critical  disconnected  terminals. 

2.3.4.  Application  Level  Taxonomy 

Figure  2-6  shows  the  taxonomy  for  the  application  level.  The  two  major 
divisions  are  based  on  the  need  for  shared  programs  and  data  among  two  or  more 
computing  nodes.  The  left  branch  of  the  taxonomy  comprises  those  applications 
which  do  not  Involve  shared  programs  or  data.  The  right  branch  consists  of 
two  classes  of  shared  programs  or  data:  replication  and  partitioning. 
Partial  replication  is  a  special  case  of  partitioning. 

The  primary  military  example  of  a  computer  network  falling  Into  the  unshared 
subc I  ass  If icat Ion  Is  a  single  AWACS  aircraft.  The  navigation,  communication, 
display  control,  and  central  computers  perform  unrelated  tasks  and,  although 
the  first  three  computers  Interface  with  the  fourth,  there  are  no  common 
programs  or  data.  A  second  military  example  is  the  NORAD  computer  complex 
CGA078J  In  which  three  separate  computer  systems  perform  distinct  but 
Interrelated  functions.  Network  reliability  problems  In  this  category  are 
task  scheduling  after  reconfiguration  (if  It  Is  possible  to  reallocate  tasks 
from  a  failed  node  onto  working  nodes),  network  recovery  on  the  application 
level,  and  Interprocess  communication  for  both  co-resident  tasks  and  those 
resident  on  different  processors. 

PAVE  PAWS  Is  the  best  example  of  a  computer  network  In  which  shared  programs 
and  data  are  replicated.  Network  reliability  problems  at  the  applications 
level  Include  updating  procedures  and  concurrency  control,  network  recovery, 
and  Interprocess  communication.  Because  of  the  complexity  of  implementing 
partitioned  distributed  programs  and  data  bases,  no  examples  were  Included  In 
this  study.  Network  reliability  problems  are  concerned  with  Interprocess 
communications,  concurrency  control  (I.  e.,  assuring  that  a  READ  request  Is 
honored  only  after  all  earlier  WRITES  are  performed),  programs  and  data 
redundancy  measures,  and  network  recovery. 


REPLICATED  PARTITIONED 


FIGURE  2-6  APPLICATION  LEVEL  TAXONOMY 


2.4  FUNCTIONAL  NEEDS  FOR  FAULT  TOLERANCE  IN  DISTRIBUTED  SYSTEMS 


This  section  describes  typical  Air  Force  C3I  needs  In  both  distributed 
computer  systems  and  for  fault  tolerance  In  these  systems.  Because  the  atm  of 
this  study  Is  to  advance  the  state  of  the  art.  this  section  emphasizes  needs 
that  are  not  currently  being  met.  However.  It  should  be  noted  that  some  fault 
tolerant  capabilities  In  distributed  computing  do  exist  at  present. 

2.4.1  Functional  Naeds  In  Distributed  Computing 

There  Is  a  pervasive  need  in  C3I  applications  to  access  programs  and  data  from 
remote  files  or  real-time  data  sources,  to  combine  these  with  local  programs 
and  data,  and  to  cause  actions  to  be  tahen  on  output  derived  from  the  combined 
data.  A  typical  case  Is  the  Identification  of  the  launch  point  of  an  Incoming 
missile  from  (a)  a  file  of  potential  launch  sites  that  may  be  In  a  local  data 
base  and  (b)  track  information  that  Is  coming  In  from  one  or  more  remote 
sites,  and  then  to  notify  affected  commands  of  the  results  of  this 
Identification.  When  data  sources  have  been  selected  In  advance,  this 
computation  can  proceed  without  operator  Involvement,  and  the  results  placed 
on  hardwired  communication  links.  However,,  when  the  situation  demands  a  more 
general  solution.  It  Is  desirable  that  an  operator  on  a  properly  privileged 
terminal  be  able  to  set  up  an  equivalent  computation  by  means  of  the  computer 
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statements  shown  In  figure  2-7  (A) 


Ideally,  the  operator  need  only  Identify  the  desired  procedure  (DATAMERGE), 
the  type  of  data  (A  and  B),  and  the  disposition  of  the  output  (creation  of  a 
file  MERGED).  The  distributed  computer  system  will  then  (1)  select  the  most 
suitable  and  available  computer  for  this  procedure,  (2)  access  the  most 
current  sources  for  data  A  and  B,  and  (3)  store  the  resultant  file  In  the  most 
accessible  device.  This  capability  Is  not  Implemented  In  currently 
operational  systems. 

Instead,  the  operator  Is  forced  to  select  a  computer,  to  Identify  sources  for 
the  programs  and  data,  and  to  tell  the  system  where  to  store  the  result  as 
Indicated  In  figure  2-7  (B).  In  routine  situations  these  operator  actions  are 
trivial,  and  a  strong  argument  can  be  made  that  the  ability  of  current  systems 
to  automate  the  access  (Item  2  in  the  previous  paragraph)  Is  a  major 
achievement.  However,  what  If  the  routinely  programmed  computer  for  this 
procedure  Is  already  fully  loaded,  the  routinely  accessed  programs  and  data 
sources  have  not  been  updated  (but  another  source  has  been),  and  the  routine 
storage  device  Is  down  or  does  not  have  sufficient  capacity  for  this  file? 
All  of  these  difficulties  are  much  more  likely  to  arise  In  exactly  those 
situations  when  C3I  systems  must  perform  'for  real'. 

Therefore,  a  substantial  Incentive  exists  for  achieving  the  capabilities  of 
figure  2-7  (A).  A  major  problem  Is  the  tendency  of  present  support  software, 
particularly  the  compilers,  to  bind  an  application  to  a  specific  computer. 
Typical  application  programs  can  only  be  run  on  one  specific  type  of  computer 
after  compilation  Into  object  code  as  Indicated  In  figure  2-8  (A).  Even  a 
routine  modification  such  as  adding  memory  will  require  recompilation  In  many 
cases.  In  order  for  a  distributed  system  to  assign  application  programs  to 
any  available  computer.  It  is  necessary  to  separate  those  portions  of  the 
compilation  which  translate  source  code  from  those  that  provide  the  computer 
adaptation  as  shown  In  figure  2-8  (B).  While  there  are  tendencies  In  that 
direction,  much  more  effort  seems  necessary  to  meet  the  functional  needs  of 
C3I  users. 

2.4.2  Functional  Needs  for  Fault  Tolerance 

An  analogous  situation  to  that  of  distributed  programs  and  data  exists  for 
fault  tolerant  features  In  distributed  systems.  Ideally,  the  operator  should 
be  able  to  Indicate  simply  that  fault  tolerance  (or  perhaps  a  specific  degree 
of  fault  tolerance)  Is  desired,  and  the  computer  system  should  then  configure 
Itself  to  provide  the  required  back-up  elements  as  Implied  by  the  Instructions 
shown  In  figure  2-9  (A).  However,  the  best  available  technology  toward  this 
end  Is  In  fixed  redundant  Installations  with  the  ability  to  automate  back-up 
programs  and  data  storage  (not  always  efficiently).  In  case  of  a  computer 
failure,  the  user  must  select  the  alternate,  purge  flies  that  may  contain 
Improper  programs  and  data,  and  Identify  a  suitable  restart  point  as  Indicated 
In  figure  2-9  (B). 

There  are  few  specific  obstacles  to  achieving  the  desired  fault  tolerance 
capabilities  once  the  problem  of  assigning  suitable  alternate  computers  has 
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been  solved.  Thus,  an  Improvement  In  tha  functional  capabilities  relative  to 
distributed  computing  will  pave  the  way  for  a  significant  Improvement  In 
practical  fault  tolerance. 


f 
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SECTION  3  —  RELIABILITY  OF  CURRENT  SYSTEMS 


The  reliability  experience  on  current  systems  represents  a  starting  point 
for  what  might  be  expected  for  future  systems  and  for  determining  the  types 
of  reliability  Improvements  that  would  be  most  effective  for  these.  The 
first  part  of  this  section  presents  data  on  ten  current  systems,  the  second 
part  analyzes  the  data,  and  the  third  part  evaluates  the  outlook  for  future 
systems  based  on  the  current  experience. 

3.1  CURRENT  EXPERIENCE 

As  part  of  this  study,  reliability  and  availability  data  on  ten  current 
systems  were  obtained  In  a  consistent  format.  All  of  these  systems  serve 
applications  In  which  It  Is  Important  that  computer  services  be  continuously 
available  throughout  a  specified  portion  of  the  day.  In  some  cases  for  24 
hours,  and  therefore  all  of  them  Incorporate  redundancy  for  at  least  a 
portion  of  the  local  computer  Installation.  None  of  them  uses  resources  at 
another  node  to  substitute  for  failed  or  overloaded  local  resources,  and  In 
this  regard  they  are  not  representative  of  the  operation  of  future 
distributed  systems.  Expectations  about  the  reliability  of  distributed 
systems  are  derived  as  extrapolations  from  the  experience  discussed  here  and 
are  presented  In  the  last  part  of  this  section. 

The  systems  for  which  reliability  data  were  obtainable  span  a  wide  range  of 
applications,  from  telephone  switching  systems  to  airline  reservations  and 
banking.  The  systems  are  not  comparable  In  terms  of  complexity.  In 
particular,  the  diversity  of  tasks  handled  by  the  FAA  en  route  air  traffic 
control  system  makes  this  a  uniquely  complex  application  area.  In  some 
cases  availability  or  reliability  goals  had  been  established  whereas  In 
others  It  was  Intended  to  provide  the  best  service  possible.  Any  grouping 
Is  somewnat  arbitrary,  and  comparisons  between  systems  must  take  Into 
account  the  wide  differences  In  requirements,  development  and  procurement 
constraints,  and  operational  practices.  The  data  are  presented  to  show 
that: 


a.  availability  data  are  being  collected  In  a  consistent  format  In  a 
variety  of  applications 

b.  several  systems  are  available  more  than  99j  of  their  expected 
operating  time 

c.  the  causes  of  failures  are  fairly  similar,  and  are  distributed 
about  evenly  between  hardware,  software,  and  other  classifications. 

Four  of  the  systems  exist  In  almost  Identical  form  In  many  locations,  whereas 
the  others  are  singular  Installations.  Table  3-1  shows  the  downtime  and 
related  data  for  systems  that  are  Installed  In  multiple  locations. 


I. 
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TABLE  3 

-  1  EXPERIENCE  OF  MULTIPLE 

INSTALLATION 

SYSTEMS 

System 

Bell 

No.  4  ESS 

FAA 

En  Route  ATC 

Federal 
Data phone  50 

Reserve  Bank 
Medium  Speed 

No.  Installed 

55 

20 

13 

14 

Op.  hrs/yr. 

8760 

7665 

3000 

3000 

Aval  lab.  goal 

99.99? 

- 

96? 

98.5? 

Actual  aval  lab. 

99.99? 

99.6? 

99? 

98.8? 

Downtime 

Avg.  hrs/yr 

0.75 

30 

30.3 

34.9 

Caused  by 

Hardware 

25? 

40? 

38? 

39? 

Software 

35? 

30? 

35? 

51? 

Other 

40? 

30? 

27? 

10? 

The  Bell  No.  4  Electronic  Switching  System  is  Intended  to  operate  24  hours 
every  day  of  the  year;  the  En  Route  Air  Traffic  Control  System  Is  shut  down 
for  maintenance  approximately  3  hours  each  day  during  the  early  morning  hours 
and  a  back-up  system  Is  then  used  to  handle  the  light  traffic  load;  both 
Federal  Reserve  Funds  Transfer  Systems  operate  approximately  12  hours  a  day  on 
weekdays  only. 

Several  availability  requirements  have  been  established  for  the  No.  4  ESS. 
One  of  these  Is  that  the  average  downtime  for  an  Installation  shall  not  exceed 
6  hours  over  a  40  year  operating  life  (corresponding  to  an  availability  of 
99.9983?).  Other  requirements  are  specific  to  the  application,  dealing  with 
the  number  of  calls  that  may  be  Interrupted  and  with  the  number  of  unit 
replacement  actions  CDAVI813*  The  availability  requirement  for  the  FRB  Funds 
Transfer  System  relates  to  the  availability  of  each  Installation  for  a  given 
month.  The  actual  availabilities  are  In  each  case  averages  over  all 
Installations  for  a  calendar  year.  The  Bell  and  FAA  actual  availability  data 
are  for  1980.  the  FRB  actuals  are  for  1981. 

Criteria  for  downtime  are  that  the  entire  Installation  becomes  Inoperative  or 
more  than  a  threshold  amount  of  time  (in  the  case  of  the  FAA  this  Is  one 
minute;  It  Is  less  for  the  other  systems  In  Table  3-1).  Partial  outages  that 
affect  only  a  limited  number  of  phones,  or  a  single  controller's  console,  are 
not  Included  In  these  statistics.  Note  particularly  that  failure  of  a  single 
computer  will  typically  not  result  In  downtime  because  back-up  Is  available. 

The  availability  experience  of  six  systems  that  exist  In  only  a  single 
Installation  Is  presented  In  Table  3-2.  Two  organizations  contributed  data  on 
two  systems  each.  In  one  case  the  two  systems  used  exactly  the  same  hardware 
configuration  but  differed  In  operational  details;  In  the  other  case  there  was 


only  gross  similarity  of  equipment.  The  availability  goal  for  each  of  the 
airline  systems  was  99.6 *.  Goals  for  the  other  Installations  were  not  stated. 
The  computer  applications  represented  In  Table  3-2  differ  greatly  In  the 
complexity  of  the  programs,  size  of  data  bases,  number  of  access  points,  and 
requirements  for  real-time  output.  Differences  In  availability  or  downtime 
therefore  do  not  Indicate  that  one  system  Is  "better"  than  another.  The  data 
for  the  airline  systems  pertain  to  1980.  All  other  data  represent  1981 
experience. 


TABLE  3-2  EXPERIENCE  OF  SINGLE  INSTALLATION  SYSTEMS 


System 

Airline 

Ml  1 Itary 

Stanford 

Gov't 

Fit.  Inf. 

Reserv. 

Syst.  A 

Syst.  B 

Lin.  Accel. 

Fiscal  S 

Op.  hrs/yr. 

8760 

8760 

7835 

8630 

8518 

6535 

Actual  aval  lab. 

99.89% 

99.65* 

99.22* 

98.40* 

98.66* 

88.52* 

Downtime 

Hrs/yr 

9.5 

31 

61 

138 

114 

750 

Caused  by 
Hardware 

23* 

41* 

28* 

49* 

56* 

64* 

Software 

16* 

35* 

10* 

1* 

35* 

14* 

Other 

61* 

24* 

62* 

50* 

9* 

22* 

3.2  ANALYSIS  OF  CURRENT  DATA 

Even  the  most  casual  review  of  the  data  presented  In  Tables  3-1  and  3-2 
Identifies  the  Bell  No.  4  Electronic  Switching  System  as  having  exceptionally 
high  availability  and  correspondingly  low  downtime.  This  system  Is  the 
product  of  a  specialized  organization  comprising  several  thousand 
professionals,  and,  as  the  designation  Indicates,  It  Is  the  fourth  major 
design  of  an  electronic  switching  system  undertaken  by  that  group. 
Publications  on  the  No.  1  ESS  go  back  at  least  to  1964  CKEIS64J,  and  features 
of  the  No.  4  ESS  were  described  as  early  as  1972  EYAUG723.  The  data  In  Table 
3-1  indicate  that  the  long-term  allocation  of  resources  to  ambitious  and 
well-specified  reliability  goals  produces  the  desired  results. 

Like  Its  predecessors,  the  No.  4  ESS  incorporates  dual  digital  processors  and 
error  correcting  code  In  memory.  Redundancy  Is  Incorporated  In  peripherals 
such  that  a  single  failure  can  not  disable  more  than  a  small  number  of  lines. 
In  1980  the  average  Installation  served  22,000  terminations.  The  computer 
program  comprised  over  2  million  Instructions  CDAVI81D. 

In  the  early  reliability  planning  for  electronic  switching  systems  It  was 
assumed  that  most  system  failures  will  be  caused  by  simultaneous  failures  of 
redundant  hardware  components,  such  as  a  second  processor  falling  while  the 
first  one  was  being  repaired.  Such  incidents  accounted  for  only  lljf  of  all 
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failures  and  95  of  the  downtime  In  the  data  reported  In  Table  3-1.  The 
balance  of  the  hardware  downtime  was  due  to  wiring  failures  or  errors  (65) , 
necessary  shutdowns  for  fault  Isolation  (65).  and  design  errors  (45). 
Software  failures  were  the  largest  single  cause  of  downtime.  They  accounted 
directly  for  295»  and  they  required  shutdowns  for  Intentional  test.  etc.  that 
caused  another  65  of  the  downtime.  The  largest  contributor  to  the  "other" 
category  for  the  ESS  was  personnel  errors  which  accounted  for  245  of  downtime. 
Unresolved  or  unclasslf table  problems  accounted  for  the  balance  of  the 
downtime  reported  In  that  category.  No  outages  due  to  power  supply  problems 
were  reported  for  ESS.  This  Is  In  sharp  contrast  with  the  experience  on  other 
systems. 

The  most  significant  facts  emerging  from  the  analysis  of  the  ESS  data  are: 

a.  the  unusually  high  availability  of  this  system 

b.  the  small  contribution  to  downtime  from  classical  f at  lure  mechanisms 

c.  the  Importance  of  software  and  personnel  failures 

The  FAA  en  route  air  traffic  control  system  utilizes  computers  that  are 
derived  from  the  IBM  360  series  and  Incorporate  a  very  effective  error 
detection  and  reconfiguration  mechanism.  Depending  on  the  workload  at  each 
air  traffic  control  center,  three  or  four  mainframes  are  provided  of  which  at 
least  one  Is  a  spare  that  Is  activated  In  case  of  a  hardware  failure  In  one  of 
the  other  units.  The  equipment  Is  representative  of  computer  design  In  the 
early  1960's  and  was  Installed  between  1967  and  1972  CGRAY80J.  Although  the 
same  hardware  and  basic  software  are  used  In  each  traffic  control  center, 
local  modifications  and  adaptations  are  authorized  to  permit  each  center  to 
meet  Its  local  needs.  This,  In  addition  to  the  varying  traffic  loads,  may 
account  for  differences  discussed  In  the  following  paragraph.  It  also  needs 
to  be  stated  that  outage  of  the  computer  system  does  not  mean  cessation  of  air 
traffic  control  operations  at  the  affected  center.  There  are  further  back-up 
provisions  which  Impose  a  higher  workload  on  the  controllers  but  permit  safe 
handling  of  controlled  aircraft. 

The  data  available  on  the  computer  failures  of  the  en  route  air  traffic 
control  system  permit  some  analysis  of  the  differences  between  centers.  The 
following  discussion  pertains  to  the  number  of  failures  (hardware,  software 
and  unknown,  but  excluding  personnel  errors)  for  the  main  computer 
Installation  at  each  center.  Number  of  failures  rather  than  downtime  was 
selected  so  as  to  exclude  (as  much  as  possible)  differences  In  maintenance 
proficiency,  and  personnel  errors  were  deleted  for  the  same  reason.  The 
average  number  of  Interruptions  due  to  the  selected  causes  during  1980  was 
162.7  with  a  standard  deviation  of  69.2.  The  lowest  number  of  Interruptions 
observed  was  15  and  the  highest  number  was  357.  Fifteen  of  the  twenty  centers 
(75>)  were  within  one  standard  deviation  of  the  mean  (compared  to  655  for  a 
theoretical  Normal  distribution).  Busy  air  traffic  control  centers  were 
represented  among  those  with  a  low  Interruption  frequency  as  well  as  among 
those  experiencing  an  above  average  number  of  Interruptions.  Because  workload 
measures  were  not  available,  no  formal  correlation  between  traffic  volume  and 
failure  frequency  was  undertaken. 


Much  of  this  difference  between  centers  must  be  due  to  controllable  causes 
(maintenance  and  administrative  practices,  nature  of  the  local  adaptations, 
etc.),  and  ft  Is  Interesting  to  speculate  how  much  benefit  might  be  derived 
from  an  attack  on  those  causes.  If  the  average  frequency  of  Interruptions 
could  have  been  reduced  to  the  low  end  of  the  central  range  (observed  mean 
minus  one  standard  deviation),  and  If  downtime  Is  proportional  to  the  number 
of  Interruptions,  then  the  average  annual  downtime  would  have  been  reduced  to 
less  than  20  hours.  This  reduction  Is  greater  than  that  which  could 
reasonably  be  expected  from  reliability  Improvement  programs  In  either 
hardware  or  software. 

Despite  the  age  of  the  equipment  and  the  complexity  of  the  computational 
tasks,  the  FAA  en  route  air  traffic  control  computers  achieved  an  availability 
of  99.6%.  The  above  analysis  suggests  that  this  figure  could  be  further 
Improved  by  control  of  maintenance  and  administrative  practices. 

The  Federal  Reserve  Bank  operates  two  computer  data  systems:  The  Dataphone  50 
system  which  Is  concerned  with  bulk  processing  and  transfer  of  computer  data 
(economic  analyses,  member  bank  status  reports),  and  the  medium  speed  system 
which  handles  Individual  fund  transfer  activities.  The  Dataphone  50  system 
was  Inaugurated  In  1975.  and  It  operates  at  50  kilobaud  over  a  dedicated  coax 
cable.  It  facilitates  point-to-point  data  transfer  between  all  nodes.  The 
medium  speed  system  has  been  in  operation  since  1970.  It  Is  laid  out  as  a 
star  network  with  the  central  node  at  Culpeper,  Virginia.  Its  nominal 
transmission  rate  Is  2.4  kilobaud,  and  it  uses  a  store-and-forward  protocol. 
A  variety  of  computers  are  Installed  at  each  of  the  Federal  Reserve  Banks  and 
have  access  to  either  network. 

Two  Interesting  observations  were  made  possible  on  the  basis  of  the  material 
furnlsned  on  the  Federal  Reserve  Communications  System:  Outages  of  the 
central  node  contributed  only  a  minor  portion  of  the  total  downtime,  and 
workload  did  not  seem  to  have  a  significant  effect  on  the  duration  of  the 
outages.  The  Culpeper  Installation,  which  serves  as  the  central  node  for  the 
medium  speed  system  had  a  total  downtime  of  only  13.5  hours  during  the  year, 
compared  to  an  outage  of  34.9  hours  for  the  average  node.  Of  the  103  failures 
at  the  central  node,  87  were  caused  by  software  problems.  Ten  failures  were 
due  to  hardware  problems,  and  six  of  these  were  reported  during  one  month, 
apparently  a  single  problem  that  was  difficult  to  diagnose.  The  conclusion 
from  these  observations  Is  that  hardware  fault  tolerance  at  a  central  node  can 
contribute  significantly  to  the  reliability  of  a  star  network,  but  that  It 
needs  to  be  supplemented  by  software  fault  tolerance  techniques  In  order  to 
obtain  the  full  benefit  of  this  link  structure. 

Several  Investigators  have  recently  reported  a  strong  correlation  between 
workload  and  computer  failures  CBEAU79,  CAST81 .  IYER82D.  In  the  data  on  the 
medium  speed  system,  downtime  during  the  peak  hours  for  that  system  (1  pm  to  4 
pm  Eastern  Standard  or  Daylight  Saving  Time)  Is  stated  separately.  The 
average  outage  per  location  reported  during  peak  time  was  7.05  hrs/yr.  Since 
the  average  outage  for  the  entire  12  hour  operating  period  was  34.9  hrs/yr, 
this  Indicates  that  less  than  one-quarter  of  the  downtime  occurs  during  that 
quarter  of  the  operating  day  during  which  the  workload  Is  highest.  While 
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downtime  and  failure  frequency  are  not  the  same,  one  expects  approximately  the 
same  fraction  of  each  to  occur  during  a  given  time  Interval  unless  special 
circumstances  prevail.  Some  explanations  for  the  deviation  from  the  generally 
expected  relation  between  workload  and  failure  frequency  are: 

a.  Maintenance  and  staffing  schedules  favor  availability  during  the  peak 
period.  Maintenance  actions  which  might  reduce  equipment  availability 
during  the  peak  time  are  avoided.  The  most  experienced  operating  and 
maintenance  personnel  are  at  work  during  the  busy  period .  Special 
procedures  are  In  effect  to  minimize  the  probability  of  a  failure  during 
the  peak  hours. 

b.  The  designation  of  the  peak  period  may  be  in  error.  The  workload 
analysis  might  have  been  conducted  at  some  time  In  the  past  when  a 
different  pattern  prevailed.  Users  may  deliberately  schedule  most  of 
their  work  during  ’off-peak1  hours,  thereby  making  these  de  facto  peak 
hours. 

c.  The  reported  relations  between  workload  and  failure  frequency  may  not 
apply.  Previous  studies  have  been  primarily  concerned  with  processing 
bound  applications  whereas  the  medium  speed  system  Is  probably  channel 
bound.  Effects  that  have  not  yet  been  identified  may  cause  a  deviation 
from  the  expected  pattern. 

All  three  factors  might  be  at  work,  but  on  the  basis  of  the  procedures 
followed  In  similar  systems  the  major  contributor  to  the  observed  effect  Is 
probably  (a).  The  data  on  the  Federal  Reserve  Communications  System  show  that 
with  proper  design  and  procedures  the  central  node  In  a  star  network  need  not 
be  the  weak  link,  and  a  disproportionate  fraction  of  the  downtime  need  not 
occur  during  the  busiest  period. 

Data  obtained  In  1976  on  the  Stanford  Linear  Accelerator  Computer  (SLAC)  show 
a  very  pronounced  dependence  of  failure  frequency,  particularly  for  failures 
due  to  software,  on  workload,  and  this  relation  Is  also  evident  In  the  current 
data.  Figure  3-1  Illustrates  the  software  failure  frequency  (total  for  1976) 
during  each  one  hour  period  of  the  day.  Note  the  peak  between  11  am  and  12 
noon,  then  a  decrease  during  the  lunch  period,  and  a  secondary  peak  lasting 
from  2  to  4  pm.  These  are  obviously  the  periods  of  highest  activity  on  the 
system. 

Only  failures  which  affected  the  entire  system  are  included  In  Figure  3-1, 
primarily  failures  In  the  operating  programs.  Because  these  programs  are 
particularly  active  when  a  new  Job  Is  started,  we  normalized  the  failure 
frequency  relative  to  the  number  of  Job  arrivals  during  each  one  hour  period. 
The  resulting  graph  Is  shown  In  Figure  3-2.  The  peaks  during  the  mid-day 
period  have  been  eliminated,  and  instead  there  Is  a  pronounced  singular  peak 
between  7  am  and  8  am.  During  this  period  not  many  new  jobs  are  started,  but 
there  Is  a  high  degree  of  system  activity  due  to  archiving,  re-Inltiallzatlon 
of  the  computer,  and  sometimes  phasing  In  a  new  release  of  the  operating 
system. 

The  Government  fiscal  system  described  In  the  last  column  of  Table  3-2 
consists  of  a  redundant  Installation  of  IBM  370/168  computers  that  service  a 
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FIGURE  3  -  1  SLAC  COMPUTER  OUTAGES  DUE  TO  SOFTWARE  FAILURES 
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nationwide  network  of  approximately  1,000  terminals  that  direct  queries  to  the 
central  data  base  and  can  also,  with  safeguards,  update  that  database.  As 
evidenced  by  the  high  downtime,  and  the  large  fraction  of  that  due  to  hardware 
failures,  the  system  appears  to  be  beset  by  maintenance  problems.  Over  100 
hours  of  outage  due  to  power  and  airconditioning  failures  are  included  In  the 
'other  causes'  classification. 

Prime  time  Is  In  this  system  defined  as  a  ten  hour  Interval  between  8  am  and  6 
pm  Eastern  Time.  For  a  subset  of  the  equipment  that  includes  the  mainframes, 
separate  failure  statistics  were  kept  for  prime  time  and  total  time.  These 
Indicate  that  outages  accounted  for  only  2.2$  of  the  prime  time  compared  to 
3.66$  of  total  time.  While  this  again  seems  to  contradict  the  workload 
dependence  of  failures.  It  Is  In  this  case  due  to  an  established  policy  which 
permitted  shutdown  of  one  of  the  redundant  computers  for  maintenance  during 
non-prime  hours.  Any  failure  In  the  active  computer  then  propagated 
Immediately  to  a  system  outage. 


3.3  INTERPRETATION  OF  FINDINGS  FOR  FUTURE  C3I  SYSTEMS 

The  most  encouraging  reliability  experience  encountered  In  this  survey  was 
that  reported  for  the  No.  4  Electronic  Switching  System.  The  most  prominent 
factors  that  account  for  the  superior  showing  appear  to  be: 

a.  A  large,  dedicated  development  staff 

b.  Multiple  Installations  of  Identical  equipment 

c.  Extensive  diagnostic  programs  for  failure  Identification 

d.  Building  on  past  experience  with  similar  systems 

The  staffing  practices  at  Bell  Labs  provide  specialists  In  all  aspects  of 
reliability  (from  device  physics  through  system  architecture)  within  the 
project  organization.  Because  of  low  employee  turnover.  Individuals  or  small 
groups  become  highly  expert  at  their  assigned  responsibilities.  Factors  (b) 
through  (d)  were  also  present  to  a  large  extent  In  the  other  systems  for  which 
multiple  Installations  existed,  and  these  probably  account  for  lower  downtime 
that  was  generally  reported  for  these.  None  of  the  systems  described  on  Table 
3-1  had  a  downtime  of  more  than  35  hours  per  year,  whereas  all  but  the  airline 
systems  described  on  Table  3-2  had  downtime  considerably  In  excess  of  35  hours 
per  year. 

It  Is  unlikely  that  the  Government  can  procure  C3I  systems  that  have  the 
legacy  of  development  and  operational  experience  Inherent  In  a  Bell 
Laboratories  electronic  switching  system.  Nonetheless,  emphasis  on  thorough 
development  and  field  testing  prior  to  a  committment  to  operation  provides 
substantial  reliability  benefits  and  should  be  practiced.  The  other  factors 
enumerated  above  are  directly  applicable  to  Government  procurements  and  should 
be  identified  as  requirements  In  future  program  plans.  With  proper  attention 
to  such  requirements.  It  seems  possible  to  achieve  availability  approaching 
99.9$  In  a  dual  Installation. 
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The  analysis  presented  earlier  In  this  section  dwelt  heavily  on  the  workload 
dependence  of  computer  failures  because  It  Is  believed  that  this  Is 
particularly  Important  for  military  C3I  Installations.  In  times  of  potential 
or  real  conflict,  the  workload  In  these  systems  Is  expected  to  Increase  very 
significantly,  and  It  Is  under  these  circumstances  that  failures  will  have  the 
worst  effect.  Thus,  predicting  an  average  availability  of  99. 9%  for  C3) 
equipment  can  be  as  misleading  as  stating  that  the  average  depth  of  a  stream 
Is  one  foot  which  leads  to  drowning  of  a  party  trying  to  ford  that  stream  and 
finds  that  the  maximum  depth  Is  much  greater.  The  availability  planning  and 
prediction  must  be  based  on  a  stated  workload  that  should  reflect  the  maximum 
a  given  Installation  will  be  exposed  to  In  case  of  military  conflict. 

Availability  can  be  Increased  by  furnishing  additional  spare  resources  In 
place,  or  by  making  remote  spare  resources  accessible  In  case  of  a  failure. 
Distributed  systems  have  a  high  potential  for  facilitating  the  latter  approach 
but  here,  again,  an  additional  workload  dependence  needs  to  be  recognized: 
utilization  of  remote  computers  requires  high  capacity  data  links,  and  these 
might  be  busy  or  unusable  (EMI,  etc.)  during  the  time  that  they  are  needed  to 
support  geographically  dispersed  computing.  These  factors  will  be  evaluated 
In  later  phases  of  this  study. 

The  star  configuration  of  networks  Is  of  particular  Interest  In  tactical 
military  systems  because  It  models  the  command  structure.  It  was  therefore 
significant  to  find  that  at  least  one  network  using  that  structure  did  not 
experience  seriously  adverse  effects  from  the  dependence  of  such  a  network  on 
the  continuous  operability  of  a  node. 


SECTION  4  —  DESIGN  ISSUES  AND  METHODS  IN  DISTRIBUTED  SYSTEMS 


This  section  surveys  previous  work  on  both  the  Identification  of  problems  of 
distributed  systems  and  design  methods  for  their  solutions.  Subsection  4.1 
provides  a  framework  for  describing  and  analyzing  the  wide  variety  of  network 
design  methodologies  along  with  the  problems  and  design  Issues  they  address. 
Subsection  4.2  discusses  previous  work  relevant  to  C3I  applications,  and 
subsection  4.3  summarizes  the  results  as  a  set  of  requirements  for  the  design 
of  fault  tolerant  distributed  systems* 


4.1.  A  FRAMEWORK  FOR  DESIGN  fCTHOOS  AND  ISSUES 

This  framework  uses  three  descriptors  to  characterize  design  methodologies  for 
distributed  systems:  design  motivation,  stage  of  network  Implementation,  and 
scope.  Design  motivation  refers  to  the  attribute  that  Is  being  optimized 
(e.g.  cost,  throughput,  etc.}.  Stage  of  network  Implementation  relates  to  the 
stage  of  development  of  networking  components,  and  ranges  from  fully  developed 
systems  to  network  designs  where  neither  the  processors,  links,  terminals,  or 
software  have  been  developed.  Scope  is  used  to  describe  the  range  of  design 
problems  addressed  from  the  formulation  of  requirements  to  the  final  detailed 
design. 


4.1.1.  Design  Motivations 

Motivations  for  distributed  systems  affect  the  approach  to  the  design,  the 
evaluation  criteria  used  to  make  tradeoffs,  and  figures  of  merit  used  to 
assess  performance.  Past  work  on  distributed  systems  can  be  classified  on  the 
basis  of  these  motivations  which  Include  Increasing  throughput  or  response 
time,  lowering  communications  costs,  conforming  to  the  structure  of  the  user 
organization,  relieving  the  load  on  an  overburdened  system,  or  Increasing 
system  reliability  and  availability. 


Increasing  system  throughput  or  decreasing  response  times  have  been  major 
motivations  for  the  general  research  community,  tactical  C3I  applications, 
real  time  control,  and  ballistic  missile  defense.  The  principal  design  Issues 
have  been  optimization  of  task  allocation  with  respect  to  throughput, 
efficient  Interprocess  communication,  distributed  data  bases-  and,  to  a 
limited  degree,  fault  tolerance  (the  ability  to  add  or  delete  units  In  a 
distributed  system  provides  this  flexibility).  Because  such  systems  are 
generally  both  non-d  1  sper sed  and  under  the  control  of  a  single  local 
commander,  distributed  computing  Is  not  Inherently  superior  to  a  central 
processor  In  these  applications.  However,  because  no  hardware  appropriate  for 
field  and  battle  conditions  has  the  requisite  throughputs,  the  use  of  several 
smaller  units  operating  In  parallel  Is  a  viable  alternative. 

Lowering  communications  costs  has  been  of  primary  concern  to  both  DoD  and 
non-DoD  agencies  that  operate  general  purpose  computing  facilities  serving 
dispersed  users.  The  major  design  Issue  Is  the  tradeoff  of  the  cost  of  local 
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processing  (e.g.  for  display  formatting  and  local  editing  on  CRTs)  versus 
that  of  communication  links  with  the  capacity  to  transmit  unreduced  data  to  a 
central  site  for  processing.  Associated  Issues  are  the  management  and 
maintenance  of  a  long-distance  communications  network,  optimal  task  allocation 
with  respect  to  cost  (generally  a  nearly  static  proposition),  minimization  of 
system  response  time  to  user  actions,  and  optimal  choice  of  compatible 
hardware  and  software  components.  While  high  reliability  Is  one  of  the  design 
goals  of  these  systems,  fault  tolerance  Is  generally  not.  One  significant 
exception,  however.  Is  the  use  of  dual  processor  minicomputers  (e.g.  Tandem  or 
Stratus)  as  front-end  processors.  Such  systems  are  generally  used  on 
reservation  or  telephone  ordering  systems  where  high  reliability  Is  a 
requirement  for  marketing  and  customer  relations. 

Conforming  to  the  structure  of  the  user  organization  Is  also  of  concern  to  all 
classes  of  users.  The  primary  C3I  manifestation  of  this  goal  as  the  governing 
facTor  In  distributed  system  design  Is  evident  In  the  structure  of  WWMCCS  In 
which  computing  centers  are  associated  with  each  of  the  major  functions. 
Another,  much  smaller  scale  example.  Is  the  Xerox  Ethernet  based  office 
automation  network.  The  major  Issues  are  designing  such  a  network  In 
accordance  with  user  and  organizational  requ I rements,  providing  for  the 
configuration  management  and  maintenance  of  a  coherent  network  given  the 
presence  of  heterogenous  nodes,  and  failure  diagnosis.  Task  allocation  Is  not 
a  consideration  because  nodes  are  general ly  not  under  the  control  of  a  central 
system  supervisor.  While  communication  costs,  reliability,  and  network 
throughput  must  be  within  acceptable  levels,  these  concerns  are  usually  not  as 
Important  as  In  networks  motivated  by  the  previous  two  considerations. 

Relieving  the  load  on  an  overburdened  central  computer  Is  often  a  motivation 
for  central  computing  facilities  at  major  defense,  scientific,  and  commercial 
sites.  Generally,  load  relief  Involves  Installation  of  dedicated  processors 
such  as  front  end  communication  processors.  Interactive  session  processors,  or 
back  end  data  base  machines.  Major  design  Issues  are  compatibility, 
throughput,  and  cost.  An  associated  Issue  may  be  system  reconfiguration  and 
fault  tolerance  which  Is  enabled  by  the  presence  of  many  Interconnected 
computers.  As  was  the  case  with  the  economically  motivated  system,  task 
allocation  Is  considered  In  the  Initial  design,  but  will  generally  not  occur 
dynamically  unless  automatic  reconfiguration  Is  provided  as  part  of  the  fault 
tolerant  Implementation. 

The  final  motivation.  Increasing  reliability  and  availability.  Is  of  primary 
Interest  In  the  present  study.  The  major  design  considerations  Include 
availability  and  effectiveness  requirements,  reconfiguration  strategies  (on 
the  node,  link,  or  system  level),  and  acceptable  degraded  operating  modes. 
System  design  requirements  associated  with  throughput,  performance,  and  cost 
define  constraints  for  the  highly  reliable  design.  Most  work  on  such  systems 
has  been  performed  In  ar  academic  setting,  and  It  has  been  concerned  with 
relatively  narrow  Issues  (e.g.  the  number  of  nodes  or  link  outages  that  can  be 
tolerated  In  a  switching  network).  This  work  has  not  considered  reliability 
Improvement  as  an  Integral  part  of  a  design  that  Is  primarily  motivated  by  the 
Issues  previously  discussed  (Increasing  throughput,  reduced  communications 
cost,  etc.).  The  Integration  of  reliability  enhancing  design  techniques  — 
especially  In  the  area  of  fault  tolerance  —  Into  a  general  design  methodology 
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has  not  bean  adequately  addressed;  the  second  phase  of  the  present 
Investigation  will  be  aimed  at  that  area. 

4.1.2.  State  of  Network  Development 


For  the  purposes  of  this  section,  we  consider  four  stages  of  system 
Implementation: 

(1)  the  network  Is  already  Implemented,  and  the  methodology  deals  with 
Its  Interconnection  with  other  networks 

(2)  network  components  (I.e.  nodes,  links,  and  software)  and  archltecure 
have  been  developed,  and  the  methodology  deals  with  the  optimal 

I nterconnect I on  strateg I es 

(3)  computing  nodes  on  the  network  have  been  developed  and  hardware 
Interfaces  are  available,  and  the  methodology  deals  with  the 
Interconnection  and  control  of  these  computing  resources 

(4)  no  network  components  have  been  developed,  and  the  methodology  deals 
with  general  characteristics  of  networks. 

The  first  stage  Is  of  importance  to  C3I  systems  on  both  the  tactical  and 
strategic  levels.  Examples  Include  the  Interconnection  of  several  tactical 
air  defense  C3I  (e.g.  AWACS  and  ground-based  radar)  systems  Into  a  single 
Integrated  tactical  Information  center  or  the  linking  of  radar  detection  sites 
(e.g.  PAVE  PAWS)  with  a  central  command  site  (e.g.  NORAD).  Literature  on  the 
design  of  networks  In  the  first  classification  (I.e.  networks  of  networks) 
centers  on  the  concept  of  there  being  a  single  "gateway"  that  serves  as  an 
Interface  between  networks.  Because  of  the  early  stage  of  development  of  this 
concept,  most  of  the  published  literature  addresses  compatibility  Issues, 
standards  (e.g.  ISO  X.25),  and  the  problems  associated  with  getting  such 
gateways  to  work  at  acceptable  levels.  Issues  associated  with  fault 
tolerance,  high  throughput,  or  cost  (beyond  the  minimum  acceptable  levels)  are 
seldom  treated  In  the  literature. 

The  next  stage,  the  design  of  networks  around  existing  and  (more  or  less) 
fully  Implemented  architectures,  has  been  treated  In  a  number  of  design 
methodologies  of  various  scopes  (see  next  subsection).  The  Issues  addressed 
by  these  methodologies  Include  the  distribution  of  applications,  placement  of 
nodes,  and  choice  of  links  (If  several  types  are  supported,  e.g.  telephone  and 
dedicated  lines).  Design  motivations  Include  economics,  performance, 
organizational,  and  reliability.  Examples  of  commercially  available 
architectures  Include  IBM’s  SNA  and  Xerox’s  Implementation  of  Ethernet. 

The  third  classification  Is  relevant  to  tactical  C3I  systems  involving  the 
Interconnection  of  smaller  Individual  computers  and  to  strategic  systems  which 
may  Involve  different  types  of  processing  performed  on  various  machines  (e.g. 
co-processors  used  together  with  an  upgrade  of  the  427-M  computers). 
Methodologies  come  in  the  form  of  articles  and  reports  documenting  the 
experience  of  researchers  In  constructing  these  networks.  The  methodologies 
deal  with  Issues  such  as  the  design  of  the  communication  links.  Inter-computer 


protocols*  operating  system  modifications,  and  those  Issues  listed  In  the 
previous  paragraph.  Examples  of  "home-grown"  networks  Include  Installations 
at  Lawrence  Livermore  Labs  and  the  China  Lake  Naval  Weapons  Center.  Primary 
design  motivations  are  related  to  Increased  throughput  and  relieving  the  load 
on  existing  mainframes. 

The  final  classification  Is  relevant  to  those  C3I  systems  which  are  designed 
without  the  use  of  any  developed  computers.  The  motivations  for  such  systems 
are  Increasing  with  the  growing  capabilities  of  microprocessors  coupled  with 
externally  Imposed  constraints  on  weight,  power  consumption,  and  volume. 
Several  methodologies  have  arisen  from  ballistic  missile  defense  applications. 
All  Issues  mentioned  In  previous  paragraphs  are  relevant,  and  additional 
Issues  Include  the  structure  of  the  computer  hardware,  communication  links, 
and  the  entire  system  software. 

4.1.3.  Scope  of,  the  Design  Methodologies 

An  Important  characteristic  of  any  design  methodology  Is  how  much  of  the 
problem  It  covers.  Nagle,  et.  al.  [NAGL793  point  out  that  design 
methodologies  for  fault  tolerant  distributed  systems  must  begin  at  the  system 
definition,  or  requirements  level  and  proceed  through  to  Implementation. 
Different  methodologies  designate  various  steps  In  the  design  of 
computer-based  systems;  Figure  4-1,  taken  from  Sloane  and  Wrobleskl  CSLOA823, 
shows  that  used  at  TRW.  Many  design  methodologies  In  the  literature  do  not 
address  applications  which  are  sufficiently  specific  that  all  the  steps  In 
Figure  4-1  are  appropriate.  Others  have  been  developed  for  problems  posed  by 
the  distributed  system  Itself,  not  by  any  application  Induced  requirements. 
Thus,  for  the  purposes  of  this  study,  three  general  scope  descriptors  will  be 
used: 


Regu  I  remen  ts -The  me+hodology  addresses  the  formulation  and  development  of 
system  requirements  from  functional  or  mission  requirements  stated  In 
non-computer  I  Ike  terms. 

Architecture  -  The  methodology  addresses  specific  Issues  In  distributed 
systems  design  Including  protocols,  task  allocation,  distributed 
operating  systems,  data  bases,  etc. 

Communications  -  The  methodology  addresses  Issues  related  to  choice  of 
communication  media,  problems  In  their  Implementation,  and  monitoring  of 
the  network  1 1 nks. 


4.2.  PREVIOUS  WORK 

This  section  reviews  some  of  the  recent  work  on  the  Identification  of  design 
Issues  and  methodologies  In  distributed  systems.  Table  4-1  presents  a 
grouping  of  the  methodologies  and  a  summary  description  based  on  the  framework 
described  In  Subsection  4.1. 
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TABLE  4-1  DESIGN  METHODOLOGIES  AND  RELATED  TOPICS  REVIEWED  IN  THIS  SECTION 


REFERENCE 

APPLICATION 

MOTIVATION 

DEVELOPMENT 

SCOPE 

Alford 

BMD 

Throughput 

No  components 

Reqts. 

CALF081] 

Reliability 

developed 

Arch. 

Meter,  Lemolne 
and  Nam  [MEIE81I] 

BMD 

Throughput 

Reliability 

No  components 
developed 

Arch. 

FitzGerald  &  Eason 
[FIT278D 

Business 

Economic 
Response  time 
Reliability 

Developed 

Sys.  Arch. 

Reqts. 

Frankol 

CFRAN82] 

Bus  1  ness 

Economic 

Developed 

Sys.  Arch. 

Commu. 

DICIccIo,  et.  al. 
CDICI79U 

Unspec. 

Organizational 
( I nter-network) 

Developed 

1  Sys.  Arch. 

Reqts. 

Popek 

CP0PE81U 

Unspec. 

Reliability 

No  components 
developed 

Reqts. 

Glen  &  Zimmerman 
[GIEN793 

Unspec. 

Performance 

Developed 

Network 

Reqts. 

4.2.1.  Design  of  Distributed  Systems  In  BW)  Applications 

Current  design  concepts  for  The  Ballistic  Missile  Defense  (BMD)  systems 
emphasize  local  networks  of  computers.  The  primary  design  motivations  are 
Increased  throughput,  decreased  response  time,  and  Improved  availability. 
Such  systems  are  generally  not  designed  around  any  existing  computers  or 
networks,  and  thus,  the  entire  range  of  distributed  system  design  Issues  must 
be  considered.  Software  Issues  Include  communications  protocols,  design  of 
the  distributed  operating  system  (l.e.  replicated  and  nonrepl Icated  modules. 
Interprocess  communication,  etc.),  design  and  Implementation  of  a  distributed 
data  base  system,  and  task  allocation. 

Alford,  at  al.  CALF0813  have  devised  a  distributed  computing  design  system 
that  Is  based  on  a  methodology  with  eight  top-level  steps  and  a  large  number 
of  lower  level  tasks  and  sub-tasks.  The  eight  overall  steps  are  system 
requirements  definition,  data  processing  (primarily  In  the  area  of  operating 
systems,  not  communications)  subsystem  engineering,  process  design  (l.e. 
defining  and  placing  processing  nodes  and  allocating  computing  tasks), 
sequential  program  design,  code  and  unit  test.  Integration  and  test,  and 
operation  and  maintenance.  The  design  methodology  Is  unique  In  clearly 
providing  for  the  definition  of  critical  functions,  network  reconfiguration, 
and  alternate  paths  In  the  requirements  phase  and  propagating  these  Issues  In 
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the  subsequent  design  steps.  Its  mein  contribution,  however.  Is  that  It 
formalizes  and  structures  the  design  process  to  the  extent  that  many  of  the 
details.  Information  interfaces,  and  error  checking  can  be  computerized. 

Van  Tllborg  and  Jaslnks!  CVANT813  deal  with  design  Issues  In  operating 
systems.  The  three  major  areas  In  the  design  of  BMD  distributed  operating 
systems  are  Interprocess  communication  (both  within  a  node  and  between  nodes), 
database  management,  and  task  allocation  (during  design,  normal  operation,  and 
reconfiguration).  The  design  objectives  of  Interprocess  communication 
protocols  are  (1)  minimizing  demands  on  processor  throughput,  (2)  detecting, 
preventing,  or  avoiding  deadlock,  (3)  reducing  the  amount  of  handshaking 
needed  to  synchronize  the  data  exchange,  and  (4)  ensuring  that  transmitted 
data  Is  received  undamaged.  Issues  In  the  design  and  operation  of  database 
systems  Include  (1)  where  to  put  data  bases  with  respect  to  the  processors 
which  will  access  them,  (2)  the  extent  to  which  the  data  should  be  replicated 
In  order  to  reduce  access  times,  and  (3)  how  to  minimize  access  times  subject 
to  database  consistency  and  Integrity  requirements.  Issues  In  task  allocation 
Include  both  distributing  the  tasks  to  the  various  nodes  and  schedul  Ing  them 
according  to  precedence  and  timing  constraints. 

Meier,  Lemolne,  and  Nam  CMEIE81J  concentrate  specifically  on  the  Issue  of 
dynamic  task  allocation  In  an  advanced  Low  Altitude  BMD  system.  There  are 
known  task  scheduling  algorithms  which  can  solve  problems  such  as  minimizing 
response  time  for  a  set  of  tasks  subject  to  timing  and  precedence  constraints. 
However,  few  of  them  are  tractable  for  large  systems.  These  authors  evaluate 
computationally  feasible  (though  not  necessarily  optimal)  algorithms  for 
effectiveness  against  specific  threat  scenarios  by  means  of  simulations.  This 
approach  can  be  quite  useful  in  the  development  of  reconfiguration  and 
re-al location  schemes  for  fault  tolerant  systems. 

4.2.2.  Business  Systems 

Many  large  business  oriented  computer  networks  are  quite  similar  to  strategic 
C3I  systems  on  all  but  the  application  specific  level.  Both  environments  use 
dispersed  mainframe  computer  Installations  connected  by  a  communications 
network,  use  similar  communications  hardware  and  software,  and  have  similar 
reliability  requ Irements.  Thus,  although  requ Irements  on  the  appl  I  cat  I on 
level  may  differ  somewhat,  literature  on  the  design  and  Implementation  of 
these  systems  Is  of  relevance  to  this  study. 

The  primary  design  motivation  of  distributed  systems  In  business  applications 
Is  the  provision  of  an  acceptable  level  of  service  at  minimum  cost  and 
development  time.  As  a  result,  such  systems  generally  rely  on  commercially 
available  networking  systems  which  Integrate  hardware  and  software  Into  a 
ready-made  architecture  that  can  be  tailored  to  the  requirements  of  the  user. 
Examples  of  such  products  Include  IBM's  System  Network  Architecture  (SNA), 
Initially  designed  for  automated  tellers,  and  Xerox's  Ethernet,  Intended  for 
office  automation  applications.  Formulation  of  user  requirements  Is  the  focus 
of  most  design  methodologies  In  this  area  (of  which  there  are  a  number)} 
Issues  that  may  appear  minor  and  subtle  to  the  network  designer  can  be  major 
contributors  to  the  success  or  failure  of  the  network  (e.g.  placement  of  CRTs, 
consideration  of  power,  space,  and  temperature  requirements,  etc.). 
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Requirements  formulation  for  distributed  systems  In  business  applications  must 
focus  on  three  Issues:  (1)  user  requirements  of  network  performance,  (2) 
traffic  that  the  network  must  bear,  and  (3)  cost  and  time  constraints. 
Options  available  to  the  system  designer  Include  CPUs,  front  end  communication 
processors  and  PABXs,  modems,  tandem  switching  centers,  multiplexers, 
concentrators,  message  switches,  and  common  carrier  services. 

FitzGerald  and  Eason  QFITZ78D  define  a  ten-step  procedure  which  can  be  grouped 
Into  three  phases:  pre-requirements,  requirements,  and  Implementation.  The 
pre-requirements  phase  Involves  problem  definition  (In  user  terms),  approach 
development,  background  Information  gathering  on  the  organization,  examination 
of  the  "people  problems"  and  other  associated  Issues  affected  by  the 
distributed  system,  and  generation  of  functional  requirements  (In  user  terms). 
The  second  phase  consists  of  formulating  system  requirements  and  constraints, 
generating  design  alternatives  that  meet  the  requirements  subject  to  the 
constraints,  and  choice  of  the  best  system.  The  Implementation  phase  Involves 
convincing  management  of  the  needs  for  the  system,  purchase  and  Installation 
of  the  system,  acceptance  testing,  development  of  operating  and  maintenance 
procedures,  performance  monitoring,  and  fine  tuning. 

This  procedure  differs  from  the  previous  BMD  appllcatfon  In  the  following 
ways: 

1.  Because  of  the  unwritten  "organizational  culture"  with  which  the 
analyst  may  not  be  familiar,  a  large  proportion  of  the  requirements 
phase  must  be  devoted  to  understanding  not  only  explicit  and 
quantifiable  requirements  but  also  Implicit  criteria  which  will 
affect  the  acceptability  of  the  design. 

2.  The  reluctance  of  most  organizations  to  Invest  In  distributed  systems 
research  and  development  necessitates  the  use  of  commercially 
available  components  with  service  and  support  from  the  vendor.  Thus, 
most  of  the  work  In  the  development  of  design  alternatives  Involves 
examination  of  the  performance  specifications  and  any  credible 
reliability  data  of  system  components  --  not  on  design  of  new 
devices. 


3.  The  non-technlcal  nature  of  the  user  organization  requires  special 
attention  to  "human  factors"  engineering  In  both  the  hardware  and 
software,  relations  with  the  declslon-makt ng  entitles  (l.e. 
management),  and  training  beyond  that  required  to  operate  specific 
software  packages  or  systems. 

System  design  Issues  are  quite  similar  In  other  aspects.  Certain  load  factors 
can  be  predicted  (e.g.  the  transmission  of  administrative  and  financial 
Information  at  predetermined  Intervals),  while  others  can  not  (e.g.  the 
Interactive  entry  of  customer  orders).  Concerns  on  the  validity  of 
transmitted  Information  are  often  central  for  applications  such  as  automated 
bank  teller  terminals  Just  as  they  are  for  C3I  applications.  System 
availability  and  reliability  for  applications  such  as  airline  reservations  are 
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crucial  to  the  economic  well  being  (l.e.  survival)  of  organizations  Just  as 
they  are  In  defense  settings. 

Franks  I  CFRAN823  concentrates  on  one  aspect  of  system  design  —  topology  of 
the  communications  network  —  and  on  one  criterion  --  cost.  Figure  4-2A  shows 
six  nodes  connected  to  a  center.  In  this  examp  I  e,  the  nodes  are  simple  CRT 
terminals  and  the  center  Is  a  minicomputer,  but  the  considerations  can  be 
extended  to  any  star  network.  Figure  4-2B  shows  the  same  functional 
configuration  Interconnected  as  a  single  multidrop  line.  The  motivation  for 
the  multidrop  configuration  Is  cost:  network  A  has  a  monthly  cost  twice  that 
of  network  B  at  1982  rates.  However,  other  motivations  may  favor  network  a. 
For  example.  If  link  bandwldths  are  a  constraint  (as  opposed  to  the  processing 
capacity  at  the  central  node),  or  If  the  reliability  of  the  links  Is  low 
compared  with  that  of  other  components,  then  A  Is  preferable.  On  the  other 
hand.  If  the  multidrop  link  is  a  higher  capacity  line  or  consists  of  redundant 
paths,  then  such  considerations  would  favor  network  B  over  A,  although  not  at 
the  same  cost  advantage. 

4.2.3.  Network  I no  Cons  I derat  I ons 

This  subsection  discusses  design  approaches  and  methodologies  In  terms  of  the 
network  rather  than  In  terms  of  an  application.  Major  problems  In  this  area 
include  the  interconnection  of  heterogeneous  networks  and  computer  systems  as 
well  as  general  software  Issues  such  as  distributed  operating  systems  or  data 
bases. 

Glen  and  Zimmerman  CGIEN79J  concentrate  on  the  problems  of  network 
Interconnection,  and  provide  solutions  In  the  form  of  analogies  to 
heterogeneous  computer  networks.  In  which  special  Interfaces  must  be  provided. 
Figure  4-3  Is  a  pictorial  representation  of  the  problem:  given  the  fact  that 
networks  A  and  B  are  geographically  dispersed  and  computationally 
Incompatible,  how  do  users  X  and  Y  communicate. 

The  primary  design  motivation  Is  to  ensure  transparency  to  the  end  user. 


In  general  terms,  the  method  proposed  by  the  two  authors  Involves  transferring 
the  message  from  the  user  node  to  the  network  gateway  (which  may  be  a  single 
processor  or  two  "gateway  halves",  one  located  at  each  network),  routing  It 
through  the  gateway  to  the  second  network,  and  then  passing  It  through  the 
second  network  to  the  appropriate  destination  node.  Such  a  strategy  Involves 
addressing  (l.e.  a  local  address  for  the  gateway  on  network  A,  a  global 
address  designation  on  the  gateway  for  network  B,  and  a  local  address 
designation  for  node  Y  on  network  B),  routing  (through  network  A  to  the 
gateway,  from  the  gateway  to  network  B,  and  through  network  B  to  the 
destination),  and  the  matching  of  Incompatible  protocols  for  error  detection, 
flow  control,  and  terminal  control  (by  means  of  definition  of  a  third  protocol 
with  Interfaces  to  those  of  networks  A  and  B). 

The  problem  of  Interconnecting  different  networks  through  the  Installation  of 
additional  hardware  and  software  Interfaces  and  the  Implementation  of 
additional  layers  In  the  communications  protocol  may  minimize  the  Impact  on 
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FIGURE  4-2  ECONOMICALLY  MOTIVATED  NETWORK  DESIGN 
(FROM  CFRAN82]) 
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FIGURE  4-3  INTERCONNECTION  OF  HETEROGENEOUS  NETWORKS  (FROM  CGIEN82]) 


existing  systems  but  can  cause  reductions  In  throughput  and  reliability.  For 
example,  the  presence  of  only  single  gateways  between  the  networks  poses 
significant  reliability  problems.  However,  If  multiple  gateways  and  network 
entry  points  are  used,  additional  scheduling,  address  I  ng,  and  contention 
resolution  issues  have  to  be  addressed.  The  additional  complexity  caused  by 
hierarchical  addressing  and  routing  schemes  can  also  result  In  reliability 
problems  In  both  data  Integrity  and  correct  execution  of  the  protocols. 
Finally,  the  use  of  an  Intermediate  protocol  across  the  gateway  further 
decreases  throughput  and  reliability. 

Alternate  approaches  are  available.  For  example,  the  more  closely  the 
Internal  networks  resemble  each  other,  the  less  complex  the  Interface.  CCITT 
standard  X.75  dictates  some  degree  of  Internal  network  commonality  CDICI79], 
and  greater  similarities  can  further  reduce  Intercommunications  problems. 
DICIcIo,  et.  al.  [DICI793  also  discuss  advantages  to  using  packets  rather 
than  virtual  circuits  as  a  means  of  network  Interconnection  for  detecting  a 
failure  In  the  message  cascade.  Although  their  approach  can  lead  to  Increases 
In  throughput  and  error  detection  capability.  It  still  contains  drawbacks  from 
the  reliability  —  and  especially  from  the  fault  tolerance  —  point  of  view. 

Popek  CP0PE8G  describes  the  reliability  problems  associated  with  a 
distributed  data  base.  Partitioning  Is  a  means  of  preventing  error 
propagation  and  Is  an  Important  means  of  reducing  the  time  necessary  for 
restart  and  recovery.  Redundancy  Is  the  means  by  which  error  detection  occurs 
as  well  as  a  necessary  part  of  any  recovery  process.  Because  distributed 
systems  lend  themselves  to  both  partitioning  and  redundancy,  they  have 
considerable  potential  for  highly  reliable  and  available  operation. 

One  of  the  major  problems  In  distributed  data  bases  Is  ensuring  the  Integrity 
of  the  data  In  the  event  of  a  system  crash.  Three  general  techniques  are 
available  for  this  purposet  atomic  transactions,  two  phase  commit,  and  the 


transaction  log.  Atomic  transactions  are  bracketed  by  "Begin  Transaction"  and 
"End  Transaction"  designations.  In  the  event  of  a  failure.  It  Is  the  system's 
responsibility  to  ensure  that  all  partially  completed  sequences  of 
Instructions  are  removed  and  all  completed  transactions  are  stored  In  the 
system's  permanent  memory.  The  two  phase  commit  procedure  Involves  a 
supervisor,  a  data  sender,  and  a  data  receiver  (all  of  which  might  be 
procedures  resident  on  a  single  host).  The  supervisor  queries  both  the 
transmitter  and  receiver  on  their  status,  and  when  both  are  ready.  It  commands 
the  sender  to  transfer  the  data  to  the  receiver.  At  the  completion  of  the 
transfer,  the  supervisor  commands  the  receiver  to  commit  the  transaction,  and 
the  receiver  returns  with  a  commit  acknowledge  signal.  If  the  system  crashes 
before  the  commit  acknowledge,  upon  recovery,  the  system  retains  the  previous 
value.  Both  the  atomic  action  and  two  phase  commit  procedure  require  a 
transaction  log  In  which  Intermediate  values  are  stored  and  can  be  recalled  In 
the  event  of  a  failure. 

While  such  constructs  are  not  unique  to  distributed  data  bases,  their 
Implementation  over  a  slow  and  noisy  network  poses  throughput  and  reliability 
problems.  For  example,  the  requirements  of  four  messages  In  order  to  write  an 
Item  to  a  non-resident  data  base  may  be  unacceptable  In  many  C3I  applications. 
Thus,  alternative  techniques,  examples  of  which  are  contained  In  the  reference 
[P0PE8O,  are  necessary. 


4.3  KEY  ISSUES  IN  THE  DESIGN  OF  FAULT  TOLERANT  DISTRIBUTED  SYSTEMS 

From  the  analysis  of  the  design  methodologies  discussed  above,  certain  key 
Issues  can  be  Identified  which  will  govern  the  design  of  fault  tolerant  C3J 
systems.  Such  systems  must  have  the  ability  to: 

1.  Detect  and  Identify  failures  on  nodes  and  links 

2.  Re-establish  contact  to  nodes  In  the  event  of  a  link  failure  by 
either  (a)  using  an  alternate  link  along  the  same  path,  or  (b) 
estab 1 1 sh I ng  an  a I  ter nate  path . 

3.  Restore  critical  computer  functions  by  either  (a)  reconfiguring  the 
node  to  restore  full  capabilities  on  a  local  level  or  (b) 
re-a I  locating  and  scheduling  tasks  among  other  nodes. 

4.  Retain  all  critical  data  (and  access  to  It) 

5.  Detect  and  recover  from  (or  prevent)  deadlock  In  the  contention  for 
resources,  execution  of  tasks,  or  accessing  of  data. 

6.  Restore  (or  prevent)  errors  In  data  transmission  and  storage. 

7.  Access  other  critical  networks  In  the  event  of  a  failure  of  the 
primary  gateway. 
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SECTION  5  -  FAULT  LOCATION  TECHNIQUES 


The  ability  to  locate  (Identify)  faults  Is  a  key  requirement  In  the 
Implementation  of  fault  tolerance.  Most  work  In  fault  location  has  been 
carried  out  at  the  logic  level  CBREU763,  and  only  a  few  authors  have  addressed 
fault  location  In  networks  of  digital  processors.  Where  the  latter  approach 
has  been  taken,  as  In  CBLOU773*  there  has  been  emphasis  on  general' 
applicability  of  the  techniques  rather  than  on  specific  Implementations.  To 
supplement  that  work,  detailed  fault  location  techniques  for  connected 
processors  are  described  here  on  the  basis  of  examples  for  specific 
configurations.  All  of  the  examples  utilize  a  combination  of  pre-processors 
and  mainframes  with  segmentation  (switching  provisions)  between  the 
pre-processors  and  the  mainframes.  The  pre-processors  may  be  signal 
processors,  communication  concentrators,  or  the  gateway  through  which 
interactive  processes  are  connected  to  the  mainframe.  The  techniques 
described  here  are  still  applicable,  with  obvious  simplifications,  where  no 
pre-processors  are  Involved. 

Three  examples  are  treated  here,  all  of  them  representative  of  configurations 
that  were  encountered  In  the  study  of  existing  fault  tolerant  or  linked 
computer  systems.  Common  assumptions  and  notation  are  discussed  first.  The 
subsequent  headings  In  this  section  then  describe  fault  location  for 

Single  user  segmented  dual  computer  systems 

Single  user  segmented  dual  computers  with  shared  memory 

Multiple  user  segmented  dual  computer  systems 

Fault  location  Is  defined  as  a  process  that  Is  Initiated  after  an  error  has 
been  detected  and  after  output  devices  that  might  be  adversely  affected  by 
diagnostic  procedures  have  been  disconnected  from  the  computers.  Faults  are 
assumed  to  be  solid  at  the  system  level.  This  Includes  cases  In  which  an 
Internal  transient  fault  has  placed  the  computer  Into  a  state  In  which  no 
further  processing  In  accordance  with  requirements  Is  possible. 


5.1  ASSUMPTIONS  AND  NOTATION 

A  typical  system  of  this  type  Is  shown  In  Figure  5-1,  and  the  capital  letter 
symbols  used  there  are  referred  to  In  the  following  text. 

5.1.1  Assumptions 

(1)  Test  Initiation  and  evaluation.  In  order  to  locate  the  source  of  the 
fault,  there  must  be  at  least  one  accessible  reliable  component  that  can  then 
test  other  components  adjacent  to  It  and  thereby  create  a  directory  of 
functioning  components.  The  following  sequence  of  operations  will  be  followed 
In  each  test.  First,  the  user  selects  (randomly.  If  necessary)  a  reliable 
component  and  Initiates  a  prestored  self-test  routine.  If  this  falls,  another 
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FIGURE  5  -  1  SINGLE  USER  SEGMENTED  SYSTEM 
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component  Is  selecTed.  The  first  reliable  component  Identified  by  this 
process  then  stimulates  another  unit  under  test  (UUT)  to  execute  a  predefined 
diagnostic  routine,  and  It  expects  to  receive  the  results  generated  by  this 
routine.  If  no  results  are  returned  within  a  specified  time,  the  UUT  Is 
marked  as  malfunctioning.  If  results  are  returned,  the  reliable  component 
compares  them  with  a  stored  benchmark  and  accepts  the  UUT  as  operative  only  If 
all  results  agree. 

(2)  User  Involvement.  As  a  baseline  for  the  fault  location  procedures,  a 
substantial  amount  of  user  Involvement  has  been  assumed.  While  the  sequence 
of  units  to  be  tested  Is  Identified  In  the  procedures  presented  below,  the 
actual  Issuance  of  commands  to  Implement  the  sequence  Is  assumed  to  be 
performed  by  the  user.  In  principle  It  is  possible  to  store  the  sequence  and 
Issue  It  as  a  single  command.  However,  the  conditions  encountered  In  the 
early  part  of  the  test  affect  the  actions  to  be  taken  In  later  ones. 
Recognition  of  these  conditions,  which  may  involve  the  Interpretation  of 
outputs  generated  by  a  malfunctioning  computer.  Is  In  general  best  handled  by 
a  trained  human  observer,  possibly  with  the  aid  of  some  computer  functions. 
The  performance  of  fully  automated  diagnostics  for  an  unrestricted  fault  set 
on  arbitrary  computer  architectures  Is  a  specialized  research  area  outside  the 
scope  of  the  effort  reported  on  here.  Likewise,  In  the  baseline  approach,  the 
user  Is  expected  to  select  an  appropriate  repair  or  reconfiguration  action 
after  the  fault  condition  has  been  identified  by  the  procedures  described 
here.  Certain  sequences  In  the  procedures  are  arbitrary,  e.  g.,  whether  to 
start  the  test  with  processor  C  or  D  In  Figure  5-1.  In  order  to  generate  a 
repeatable  procedure,  processor  C  was  selected  as  the  first  UUT.  A 
knowledgeable  user  may  decide  on  the  basis  of  past  history  or  Immediate 
observations  that  D  Is  more  likely  to  be  at  fault  and  therefore  start  the  test 
there.  These  deviations  are  considered  permitted  but  they  are  not  an 
essential  part  of  the  user  Involvement  In  the  test  procedures. 

(3)  Perfect  test  coverage.  Generally,  the  time  and  storage  cost  for  a  test 
Is  proportional  to  the  thoroughness  of  the  testing  of  a  hardware  component. 
It  was  assumed  that  sufficient  resources  can  be  allocated  for  a  test  that 
gives  a  very  high  assurance  that  It  will  not  pass  a  malfunctioning  component 
(nearly  100$  test  coverage).  The  failure  of  software  used  for  testing  was  not 
allowed  for.  To  the  extent  that  actual  diagnostics  do  not  yield  100$  test 
coverage,  a  malfunctioning  component  might  be  declared  operable  and  a  failure 
will  then  occur  In  a  later  test  step  or  during  operation.  User  Involvement 
can  alter  the  sequence  of  testing  so  that  another  operable  combination  of 
components  can  be  configured. 

(4)  Two-way  transmission.  It  was  assumed  that  links  can  carry  test-related 
messages  In  both  directions.  The  bandwidth  required  for  this  purpose  Is  small 
because  the  stimulus  Is  usually  expressed  as  a  single  command,  and  the  result 
can  be  compressed. 

(5)  Distinction  between  processors  and  links.  It  Is  often  difficult  to  draw 
a  clear  distinction  betw  .n  failures  In  a  link  and  In  a  processor  connected  to 
that  link.  If  a  malfunction  disables  processor  P’s  communication  with  all  Its 
neighbours  Q1  ....  Qn,  then  the  failure  Is  attributed  to  the  processor 
although  It  might  be  a  common  failure  In  all  links.  On  the  other  hand.  If  the 
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failure  leaves  at  least  one  communication  path  between  P  and  Qf  operable,  then 
the  failure  Is  attributed  to  the  affected  links  although  It  might  be  a  failure 
In  the  processor  that  affects  a  portion  of  the  communications  capabilities. 

(6)  Irrecoverable  faults.  The  purpose  of  locating  faulty  processors  Is  to 
remove  them  from  the  net  and  to  resume  real-time  operations.  However,  there 
must  be  at  least  one  normally  functioning  path  between  the  Input  (S  In  Figure 
5-1)  and  the  user  (U).  Faults  which  do  not  leave  such  a  path  are  not  worth 
locating  because  the  system  can  not  be  automatically  restored  to  useful 
service.  The  fault  location  procedures  therefore  stop  as  soon  as  an 
Irrecoverable  fault  has  been  identified. 

(7)  Preprocessors  with  shared  memory  (applicable  to  5.3  only).  A  test  of  a 
preprocessor  Involves  use  of  shared  memory  and  therefore  tests  the  shared 
memory.  It  Is  assumed  that  the  shared  memory  has  error  correcting  code  that 
masks  transient  and  Isolated  permanent  memory  faults.  Therefore  only  solid 
failures  affecting  substantial  areas  of  the  memory  will  affect  preprocessor 
operation.  Memory  Is  regarded  as  functioning  If  at  least  one  preprocessor 
passes  tests  Involving  shared  memory.  Links  to  the  shared  memory  are  treated 
as  part  of  the  preprocessors  served  by  them  since  a  preprocessor  without 
access  to  shared  memory  Is  not  suitable  for  normal  operation. 

5.1.2  Notation  and  Types  of  Tests 

Three  types  of  tests  are  used  In  the  fault  location  procedures: 

Type  1  -  Direct  user  test 

Type  2  -  Test  with  only  forward  Information  flow 

Type  3  -  Test  with  reverse  Information  flow 

Examples  of  the  notation  used  and  of  the  application  of  these  tests  are  given 
below. 

Type  1  -  notation  U  -t->  C 

This  denotes  a  test  In  which  the  user  (U)  stimulates  computer  C  and  receives 
results  from  It.  This  test  Is  applicable  only  to  processors  directly 
accessible  by  the  user,  such  as  C  or  D  In  Figure  5-1. 

Type  2  -  notation  C  -t->  A 

This  type  of  test  Is  used  for  establishing  operability  of  the  preprocessors 
and  associated  links.  It  Is  assumed  that  the  selection  of  the  tester  (C)  and 
of  the  UlIT  (A)  Is  made  by  the  user,  and  that  the  user  has  vlslbll  Ity  of  the 
outcome  (at  least  pass/fall)  of  the  test. 

Type  3  -  notation  Ad  -t->  C 

This  means  that  A,  while  being  stimulated  through  D,  tests  C  (backward  flow  of 
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Information).  The  lower  case  letter  Is  used  In  lieu  of  a  subscript.  If  U 
-t->  C  has  failed  while  U  -t->  D  and  D  -t->  A  have  succeeded.  It  Is  not  clear 
whether  C  has  failed  or  whether  the  link  from  the  user  to  C  Is  Inoperable.  By 
sending  the  test  Initiation  order  through  D  for  A  to  test  C,  an  Independent 
means  Is  found  for  determining  whether  C  Is  operable. 

Two  symbols  connected  by  a  dash  represent  the  link  between  the  elements 
designated  by  the  symbols,  e.  g.,  U-0  stands  for  the  link  between  the  user  and 
computer  D,  and  A-C  stands  for  the  1 1 nk  between  computers  A  and  C. 


5.2  SINGLE  USER  SEGMENTED  DUAL  COMPUTER  SYSTEMS 

The  fault  location  procedure  presented  below  applies  to  the  single  user 
segmented  system  without  shared  memory.  The  structure  connected  by  the  broken 
line  In  Figure  5-1  Is  not  present  for  this  case. 

The  procedure  consists  of  first  testing  the  processors  connected  to  the  user 
Interface,  C  and  D  (and,  by  Implication,  the  links  to  the  user).  After  these 
have  been  found  to  be  operational,  the  processors  at  the  source  Interface,  A 
and  B,  and  their  backward  links  (to  C  and  D)  are  tested.  The  links  from  A  and 
B  to  the  source  are  considered  to  be  part  of  the  latter  and  are  not  explicitly 
validated  In  this  procedure.  If  the  preprocessors  check  out  on  the  test 
described  here  and  yet  no  useable  Information  Is  received  In  the  operational 
mode,  failure  of  the  source  links  or  of  the  source  Is  Implied. 

The  normal,  forward  directed  (upward  In  Figure  5-1),  part  of  the  test  Is 
flowcharted  In  Figures  5-2  through  5-4.  Where  failures  were  encountered  In 
tests  Initiated  directly  by  the  user  (of  the  form  U  -+->  X),  backward  tracing 
Is  used  In  later  phases  of  the  test  to  determine  whether  the  failure  affects 
an  entire  computer  or  only  the  user  link  or  Interface.  Certain  other  links 
are  also  diagnosed  separately  from  the  processors  which  they  serve  by  means  of 
backward  directed  tests.  These  diagnostics  are  shown  In  Figures  5-5  through 
5-8.  Not  all  computer  installations  that  use  the  strucuture  of  Figure  5-1  may 
have  the  capability  to  perform  backward  directed  tests.  This  capability  Is 
not  essential  for  a  determination  of  the  operational  status  (I.  e.,  which 
processors  are  accessible  and  working  properly),  but  where  It  Is  not  provided 
many  link  failures  can  not  be  distinguished  from  processor  failures. 

A  summary  of  the  diagnostic  Information  obtained  at  each  step  of  the  fault 
location  procedure  is  shown  In  the  lower  part  of  each  figure.  The  case  number 
Is  the  number  sequence  shown  In  the  rectangular  boxes  after  each  test.  The 
designation  'tested'  used  In  the  summary  means  that  the  components  or  links 
were  Identified  as  operational  In  the  sequence  performed  up  to  this  point.  In 
some  Instances  a  computer  can  be  Identified  as  operational  In  the  test 
sequence  although  It  can  not  be  accessed  In  normal  operation  due  to  failures 
In  other  processors  or  links.  These  computers  are  designated  as  'non-usable  ' 
In  the  diagnostic  summaries  (single  asterisk),  in  other  cases,  the 
diagnostics  can  not  distinguish  between  processor  failures  or  simultaneous 
failures  In  all  links  to  a  processor;  this  situation  Is  Identified  by  a  double 
asterisk  tn  the  summaries.  From  the  operational  point  of  view.  It  makes  no 
difference  whether  the  processor  or  all  links  have  failed. 
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SUMMARY  OF  DIAGNOSTICS 


Case 

Failed  Components 

Tested 

No. 

of 

Components 

Tested 

Links 

1 

None 

C 

1 

1.1 

None 

C,D 

2 

1.1.1 

None 

A,C,D 

3 

1.1.2 

A  or 

A-D 

C,D 

2 

1.2 

D  or 

U-D 

C 

1 

1.2.1 

D  or 

U-D 

A,C 

2 

1.2.2 

D  or 

U-D, 

A  or 

A-C 

C 

1 

2 

C  or 

U-C 

none 

0 

2.1 

C  or 

U-C 

D 

1 

2.1.1 

C  or 

U-C 

B,D 

2 

2.1.2 

C  or 

U-C, 

B  or 

B-D 

D 

1 

2.2 

C  or 

U-C, 

D  or 

U-D 

none 

0 

FIGURE  5  - 

2  FAULT 

LOCATION  PROCEDURE  OF  SECTION  5.2, 

PART  1 
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Case 

Failed  Components 

Tested 

Components 

No.  of 

Tested  Links 

1.1.1 

None 

A, 

C,  D 

3 

1.1. 1.1 

None 

A, 

B,  C,  D 

4 

1.1. 1.1.1 

None 

A, 

B,  C,  D 

5 

1.1. 1.1. 1.1 

None 

A, 

B,  C,  D 

6 

1.1 .1.1 .1.2 

B-C 

A, 

B,  C,  D 

5 

1.1. 1.1.2 

A-C 

A, 

B,  C,  D 

4 

1.1. 1.1.2. 1 

A-C 

A, 

B,  C,  D 

5 

1.1.1. 1.2.2 

A-C, 

B-C 

A, 

B,  C\  D 

4 

1.1 .1.2 

B  or 

B-D 

A, 

C,  D 

3 

1.1. 1.2.1 

B  or 

B-D 

A, 

C,  D 

4 

1.1. 1.2. 1.1 

B-D 

A, 

B,  C,  D 

5 

1.1. 1.2. 1.2 

B*« 

A, 

C,  D 

4 

1.1 .1.2 .2 

B  or 

B-D,  A-C 

A, 

C,  D 

3 

1.1. 1.2.2. 1 

B-D, 

A-C 

A, 

B,  C,  D 

C*,  D 

4 

1.1.1 .2.2.2 

B««, 

A-C 

A, 

3 

•  non-usable  »*  processor  or  all  connections  have  failed 


FIGURE  5-3  FAULT  LOCATION  PROCEDURE  OF  SECTION  5.2,  PART  2 
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SUMMARY  OF  DIAGNOSTICS 


Case 

Failed  Components 

Tested 

No.  of 

Components 

Tested  Links 

1.1.2 

A  or 

A-D 

c. 

D 

2 

1. 1.2.1 

A  or 

A-D 

B, 

C,  D 

3 

1.1.2. 1.1 

A-D 

A, 

B,  C,  D 

4 

1.1.2. 1.1.1 

A-D 

A, 

B,  C,  D 

5 

1.1.2. 1.1.2 

A-0, 

B-C 

A, 

B,  C,  D 

4 

1.1. 2. 1.2 

A** 

B, 

C,  D 

3 

1.1.2. 1.2.1 

A*# 

B, 

C,  D 

4 

1.1.2. 1.2.2 

A**, 

B-C 

B, 

C«,  D 

3 

1.1. 2.2 

A  or 

A-D, 

B  or  B-D 

c. 

D* 

2 

1.1. 2.2.1 

A-D, 

B  or 

B-D 

A, 

C,  D* 

3 

1. 1.2.2. 1.1 

A-D. 

B-0 

A, 

B,  C,  D* 

4 

1.1. 2.2. 1.2 

A-0, 

B*» 

A, 

C,  D* 

3 

1.1. 2.2.2 

A**, 

B  or 

B-D 

c. 

D* 

2 

1.1. 2.2.2. 1 

A**, 

B-D 

B, 

C,  D* 

3 

1.1 .2.2.2. 2 

A*», 

B«* 

c. 

D*  (not  oper 

.)  2 

*  non-usable  *#  processor  or  ell  connections  have  felled 


FIGURE  5-4  FAULT  LOCATION  PROCEDURE  OF  SECTION  5.2,  PART  3 


SUMMARY  OF  DIAGNOSTICS 


Case 

Failed  Components 

Tested 

Components 

No.  of 

Tested  Links 

1.2.1 

D  or  U-D 

A, 

C 

2 

1. 2.1.1 

D  or  U-D 

A, 

B,  C 

3 

1.2. 1.1.1 

U-0 

A, 

B,  C, 

D* 

4 

1.2.1. 1.1.1 

U-D 

A, 

B,  C, 

D* 

5 

1.2. 1.1. 1.2 

U-O,  A-0 

A, 

B,  C, 

D* 

4 

1.2. 1.1. 2 

D  or  (U-0  &  B-0) 

A, 

B,  C 

3 

1.2.1. 1.2.1 

U-0,  B-0 

A, 

B,  C, 

D* 

4 

1.2. 1.1. 2.2 

D»* 

A, 

B,  C 

3 

1.2.1 .2 

D  or  U-D,  B  or  B-C 

A, 

C 

2 

1.2. 1.2.1 

U-D,  B  or  B-C 

A, 

C,  D» 

3 

1.2.1  .2.1.1 

U-0,  B-C 

A, 

B*,  C, 

,  D* 

4 

1.2.1. 2. 1.2 

U-0,  B** 

A, 

C,  D» 

3 

1.2.1. 2 .2 

B**,  D** 

A, 

C 

2 

•  non-usable  M  processor  or  si  I  connections  have  felled 


FIGURE  3-5  FAULT  LOCATION  PROCEDURE  OF  SECTION  5.2.  PART  4 

50 


SUMMARY  OF  DIAGNOSTICS 


Case 

Failed  Components 

Tested 

Components 

No.  of 

Tested  Links 

1.2.2 

D  or  U-D,  A  or  A-C 

C 

1 

1. 2.2.1 

D  or  U-D,  A  or  A-C 

B,  C 

2 

1. 2.2.1. 1 

U-D,  A  or  A-C 

B,  C,  0* 

3 

1.2.2. 1.1.1 

U-D,  A-C 

A*,  B,  C, 

0*  4 

1. 2.2.1. 1.2 

U-0,  A** 

B,  C,  0* 

3 

1. 2.2.1. 2 

1. 2.2.2 

D  or  (U-D4B-D),  A  or  A-C 
D  or  U-0,  A  or  A-C, 

B,  C 

2 

B  or  B-C 

C*  (non-oper.)  1 

*  non-usable  **  processor  or  all  connections  have  failed 


FIGURE  5-6  FAULT  LOCATION  PROCEDURE  OF  SECTION  5.2,  PART  5 


SUMMARY  OF  DIAGNOSTICS 


1.1 

1.1.1 


Failed  Components 


C  or  U-C 
C  or  U-C 
U-C 
U-C 


1.1.2  U-C,  B-C 


,1.2  Co 
,1.2.1  U-C 
,1.2.2  C*» 
,2  Co 
,2.1  U-C 
,2.1.1  U-C 


C  or  (U-C  4  A-C) 
U— C,  A— C 


C  or  U-C,  A  or  A-0 
U-C,  A  or  A-0 
U-C,  A-0 


Tested 

Components 


No.  of 

Tested  links 


B,  0 
A,  B,  D 
A,  B,  C*,  D 
A,  B,  C»,  D 
A,  B,  C*,  0 
A,  B,  D 
A,  B,  C*,  D 

A,  B,  D 

B,  D 

B,  C»,  D 
A*,  B,  C*,  D 
B,  C»,  0 
B,  D 


.1.2.1. 2  U-C,  A**  B,  C*,  0  3 

.1.2.2  A**,  C**  ,  B,  D  2 

non-u sable  **  processor  or  all  connections  have  failed 

FIGURE  5-7  FAULT  LOCATION  PRODECURE  OF  SECTION  $.2,  PART  6 
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SUMMARY  OF  DIAGNOSTICS 


Case  Failed  Components  Tested  No.  of 

Components  Tested  Links 

2.1 .2  C  or  U-C,  B  or  B-D  D  1 

2.1. 2.1  C  or  U-C,  B  or  B-D  A,  D  2 

2.1. 2.1.1  U-C,  B  or  B-D  A,  C*,  D  3 

2.1. 2.1. 1.1  U-C,  B-0  A,  B*,  C*,  D  4 

2. 1.2.1. 1.2  U-C,  B**  A,  C*,  D  3 

2. 1.2. 1.2  C  or  (U-C4A-C),  B  or  B-D  A,  D  2 

2. 1.2.2  C  or  U-C,  A  or  A-D, 


B  or  B-D  D»  (non-oper.)  1 
*  non-usable  «•  processor  or  all  connections  have  failed 

FIGURE  3-8  FAULT  LOCATION  PROOECURE  OF  SECTION  5.2,  PART  7 
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The  detailed  procedure  for  performing  these  tests  Is  presented  In  a  design 
language  based  on  Pascal  In  the  Appendix.  Where  case  designations  are  used  In 
the  appendix,  they  correspond  to  those  shown  In  the  flowcharts;  however,  not 
all  case  designations  shown  on  the  flowcharts  are  mentioned  In  the  design 
language  version  of  the  test  procedure. 


5.3  SINGLE  USER  DUAL  COMPUTERS  WITH  SHARED  MEMORY 

The  fault  location  procedure  presented  under  5.2  above  Is  also  effective  for 
the  case  where  the  two  processors  have  shared  memory  (the  broken  line  In 
Figure  5-1  represents  the  connection).  The  analysis  presented  below 
Interprets  the  outcomes  of  the  procedures  of  5.2  for  the  case  of  shared 
memory. 

It  Is  assumed  that  any  failure  in  the  shared  memory  will  result  In  failure  of 
tests  for  both  Processor  A  and  Processor  B.  Therefore,  If  either  Processor  A 
or  Processor  B  Is  found  to  be  operational  It  may  be  assumed  that  the  shared 
memory  Is  functioning  correctly.  Conversely,  when  both  processors  are  found 
to  be  Inoperative  there  is  a  high  probability  that  the  shared  memory  has 
failed,  although  this  case  cannot  be  distinguished  by  the  gross  diagnostics 
used  here  from  a  simultaneous  failure  of  A  and  B  or  from  a  failure  of  all 
backward  connections  (lines  going  down  In  Figure  5-1).  The  detailed  test  data 
will  usually  permit  differentiation  between  processor  and  memory  failures. 

The  diagnostics  furnished  by  the  tests  shown  In  Figures  5-2  through  5-8  are 
analyzed  In  Table  5-1.  If  a  test  case  results  In  a  definitive  finding 
regarding  the  shared  memory  (either  usable  or  not  usable  ,  this  finding  will 
also  be  valid  for  all  subsidiary  test  cases,  and  they  are  not  separately 
listed.  Thus,  the  finding  that  the  shared  memory  Is  usable  for  1.1.1  Implies 
that  it  Is  also  usable  for  all  cases  l.l.l.x.x.x  where  x  may  represent  either 
a  1  or  a  2.  The  figure  numbers  shown  In  the  table  are  valid  until  a  new 
figure  number  Is  shown. 

A  finding  of  'not  usable  '  for  the  shared  memory  can  arise  either  from 
Inability  to  access  either  one  of  the  processors  using  the  memory,  or  from  the 
observation  that  both  are  inoperative.  In  the  former  case,  the  status  of  the 
memory  Is  really  unknown,  and  this  Is  Indicated  by  a  single  x  In  the  table. 
In  the  latter  case.  It  Is  highly  likely  that  the  shared  memory  has  failed, 
although,  as  Indicated  above,  other  possibilities  cannot  be  completely  ruled 
out,  and  this  case  Is  designated  by  xx  In  the  table. 
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TABLE  5  -  1  SHARED  MEMORY  DIAGNOSTICS 


Case 

Shared  Memory 

Further 

Ref. 

Usable  Not  useable 

Diagnostics 

Figure  No. 

Requ I  red 

1.1.1 

X 

5-2 

1.1.2 

X 

1.2.1 

X 

1.2.2 

X 

2.1.1 

X 

2.1.2 

X 

2.2 

X 

1 .1 .2.1 

X 

5-4 

1 .1 .2.2 

X 

1 .1 .2.2.1 

X 

1.1 .2.2.2 

X 

1.1. 2 .2 .2.1 

X 

1.1. 2 .2.2.2 

XX 

1 .2.2.1 

X 

5-6 

1. 2.2.2 

XX 

2.1 .2.1 

X 

5-8 

2.1 .2.2 

XX 

5.4  MULTIPLE  USER  SEGMENTED  COMPUTER  SYSTEMS. 

Atypical  configuration  of  this  type  Is  shown  In  Figure  5-9.  It  will  be 
recognized  that  this  figure  Is  Identical  with  Figure  5-1  except  for  the 
connections  at  the  user  and  source  ends.  To  capitalize  on  the  similarity.  It 
Is  convenient  to  divide  the  fault  location  procedure  Into  three  phases  that 
establish  the  operability  of  (1)  the  user  Interface,  (2)  the  computer  network 
proper  and  (3)  the  source  Interface.  Phase  1  and  Phase  3  procedures  are 
developed  In  detail  below.  Phase  2  procedures  represent  an  adaptation  of 
those  described  In  Section  5.2. 

At  the  start  of  Phase  1,  each  user  must  determine  the  accessibility  of 
computers  C  and  D  by  a  slgn-on  procedure.  When  this  Is  completed,  there  will 
be  an  access  log  within  C  and  D  which  will  be  of  the  form  (U1)(U2)(U3)  where 
each  term  will  have  a  value  of  1  If  Ul  has  logged  In  and  a  value  of  0 
otherwise.  Thus,  If  the  access  log  for  computer  C  Is  101,  this  means  that 
users  Ul  and  U3  can  access  this  computer  and  user  U2  cannot.  Any  0  value 
represents  a  diagnostic  for  an  Inoperative  user  link.  In  addition,  the 
ensemble  of  the  access  logs  determines  the  procedure  to  be  followed  In  Phase 
2.  For  that  purpose,  the  outcomes  of  Phase  1  can  be  classified  In  the 
fol lowing  manner: 

la.  One  or  more  users  can  access  both  C  and  D 
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1b.  C  and  0  can  both  be  accessed,  but  not  by  the  same  user(s) 

Id.  C  can  be  accessed  by  one  or  more  users,  0  cannot  be  accessed 

1c2.  D  can  be  accessed  by  one  or  more  users,  C  cannot  be  accessed 

Id.  Neither  C  nor  0  can  be  accessed  by  any  user. 

The  classification  of  Phase  1  outcomes  Is  derived  from  the  access  log  codes 
generated  within  the  C  and  D  computers  as  shown  In  Table  5-2.  The  "1" 
prefixes  have  been  omitted  in  the  table. 

TABLE  5-2  CLASSIFICATION  OF  PHASE  1  OUTCOMES 

Access  Access  Log  D 

Log  C  111  110  101  011  100  010  001  000 

111  a  a  a  a  a  a  a  cl 

110  aaaaaabcl 

101  a  a  a  a  a  b  a  cl 

Oil  aaaabaacl 

100  aaababbcl 

010  aabababcl 

001  a  b  a  a  b  b  a  cl 

000  c2  c2  c2  c2  c2  c2  c2  d 


If  Phase  1  produces  an  outcome  In  the  la.  classification.  Phase  2  can  be 
Initiated  by  any  user  who  can  access  both  computers,  and  the  procedure  of 
SecTion  5.2  can  be  applied  without  modification.  If  Phase  1  produces  an 
outcome  In  the  1b.  classification,  separate  actions  by  two  users  will  be 
necessary  during  the  Phase  2  procedure.  The  user  who  can  access  C  (but  not  0) 
proceeds  In  accordance  with  case  1.2  on  Figure  5-2,  and  the  user  who  can 
access  D  (but  not  C)  proceeds  In  accordance  with  case  2.1  In  Figure  5-2.  If 
the  Phase  1  procedure  results  In  a  Id.  cfasslfcatlon,  only  the  case  1.2 
procedure  can  be  Initiated,  and  If  It  results  In  a  1c2.  classification  only 
the  case  2.1  procedure  can  be  Initiated.  Where  Phase  1  terminates  with  a  Id. 
classification  the  system  Is  not  recoverable,  and  no  Phase  2  activity  can  be 
conducted. 

A  similar  classslf IcatJon  of  Phase  2  outcomes  Is  utilized  to  determine  the 
Phase  3  procedure.  These  classifications  are  based  on  the  operability  of  the 
A  and  B  computers  (these  are  also  sometimes  referred  to  as  preprocessors)  In 
accordance  with  Table  5-3. 

TABLE  5-3  CLASSIFICATION  OF  PHASE  2  OUTCOMES 

Computer  Computer  B 

A  Operable  Not  Operable 


Operable  2a 

Not  Operable  2b2 


2b  1 
2c 


With  both  preprocessors  operative  (case  2a),  the  fault  location  technique  can 
distinguish  between  a  source  failure  and  a  failure  of  a  single  link  associated 
with  a  sensor.  When  only  one  of  the  preprocessors  Is  operative  (cases  2b 1  and 
2b2),  this  distinction  can  not  be  made.  When  both  preprocessors  are 
Inoperative,  no  diagnostics  of  the  source  subsystem  are  possible. 

Testing  of  the  sources  and  links  In  Phase  3  Involves  observation  by  the  A  and 
B  computers  of  predefined  characteristics  of  the  Input  data  stream,  such  as 
frequency  of  bit  value  transitions,  frequency  of  start  of  cycle  characters  and 
absence  of  alarm  characters.  The  observations  at  each  processor  are  In  the 
following  designated  as  (S1)(S2)  where  a  value  of  1  for  SI  designates  an 
operable  condition  (predefined  characteristics  are  present),  and  a  value  of  0 
designates  an  Inoperable  condition.  Thus,  If  the  observation  at  computer  A 
has  a  value  of  10  this  means  that  source  SI  appears  operable  and  source  S2 
appears  Inoperable.  The  classification  of  the  combined  observations  from 
computers  A  and  B  (for  Phase  2  outcome  of  2a)  during  Phase  3  Is  shown  In  Table 
5-4. 


TABLE  5-4  CLASSIFICATION  OF  PHASE  3  OUTCOMES 


Computer  Computer  B 


A 

11 

10 

01 

00 

11 

3a 

3b 

3b 

3c 

10 

3b 

3d 

3bb 

3e 

01 

3b 

3bb 

3d 

3e 

00 

3c 

3e 

3e 

3f 

These  classifications  have  the  following  meaning 
3a.  Both  sources  fully  usable 

3b.  One  source  fully  usable;  one  source  usable  on  one  (Ink  only 

3bb.  Both  sources  usable  on  one  link  only 

3c.  Both  sources  accessible  from  only  one  preprocessor 

3e.  Only  one  source  accessible  from  one  preprocessor 

3f.  No  sources  accessible 

For  a  Phase  2  outcome  of  2b1.  only  the  last  column  In  Table  5-3  is  applicable, 
and  for  a  Phase  2  outcome  of  2b2.  only  the  last  row  In  Table  5-3  Is 
applicable.  A  Phase  2  outcome  of  2c  Is  Indistinguishable  from  a  Phase  3 
outcome  of  3f. 
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APPENDIX 


DESIGN  OF  A  FAULT  LOCATION  PROGRAM 


A  computer  program  for  the  fault  location  procedure  described  In  Section  5  has 
been  designed.  The  specific  fault  location  procedures  are  listed  here  in  a 
Pascal-1  Ike  design  language.  CC  and  ID  are  used  as  shorthand  notations  for 
'•Begin"  and  "End",  respectively.  <  Routine  name  >  Indicates  transfer  to  a 
rout  I ne  that  Is  II sted  I ater . 


A. 1  SINGLE  USER  SEGMENTED  DUAL  COMPUTER  SYSTEMS 

Reference  Figure  5-1.  The  shared  memory  Is  not  present  In  this  case. 

procedure  locate; 
case  (U  -t->  C)  fit 
pass:  "case  1:  C  4  (C-U)  are  ok" 
case  (U  -t->  D)  of 

pass:  "case  1.1:  D  4  (D-U)  are  ok" 
case  (D  -t->  A)  of 
pass:  "case  1.1.1:  A  4  (A-D)  are  ok" 

<locate-1 .1 .1>; 

fall:  "case  1.1.2:  A  or  (A-D)  Is  malfunctioning" 

<locate-1 .1 .2> 
and.  "case  1.1"; 

fall;  "case  1.2:  D  or  (D-U)  Is  malfunctioning" 
case  (C  -t->  A)  cl 
pass:  "case  1.2.1:  A  4  (A-C)  are  ok" 

<locate-1 ,2.1>; 

fall:  "case  1.2.2:  A  or  (A-C)  Is  malfunctioning" 

<locate-1 ,2.2> 
end"case-1 .2" 
end"case-1 ; 

fall:  "case  2:  C  or  (C-U)  Is  malfunctioning" 
case  (U  -t->  D)  fit 
pass:  "case  2.1:  D  4  (D-U)  are  ok" 

—  This  case  Is  similar  to  case  1.2  except  that  — 

—  C  Is  Interchanged  with  D  and  A  with  B  — 

fall: 

—  The  system  Is  not  recoverable  — 

end" locate" 
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procedure  locate-1 .1 .1 ; 
case  (D  -+->  B)  of 

pass:  "case  1.1. 1.1:  B  &  (B-D)  are  ok" 

IT  1 f  (C  -+->  A)  then  mark  '(A-C)  Is  ok* 

else  mark  '(A-C)  Is  malfunctioning'; 

I f  (C  -t->  B)  then  mark  '(B-C)  Is  ok' 

else  mark  '(B-C)  Is  mal functioning '33; 

fall:  "case  1.1. 1.2:  B  or  (B-D)  Is  malfunctioning" 

IT  I f  (C  -t->  A)  then  mark  '(A-C)  Is  ok' 

else  mark  '(A-C)  Is  malfunctioning'; 

I f  (C  -t->  B)  then  mark  'B  &  (B-C)  are  ok  and 

(B-D)  Is  malfunctioning' 
else  mark  'B  Is  malfunctioning  or 

(B-C)  &  (B-D)  are  malfunctioning '33 

end"locate-1 .1.1". 

procedure  locate-1 .1 .2; 
case  (D  -t->  B)  of 

pass:  "case  1.1. 2.1:  B  &  (B-D)  are  ok" 

IT  I f  (C  -t->  A)  then  mark  'A  &  (A-C)  are  ok  and 

(A-D)  Is  malfunctioning' 
else  mark  'A  Is  malfunctioning  or 

(A-C)  &  (A-D)  are  malfunctioning'; 

I f  (C  -t->  B)  then  mark  '(B-C)  Is  ok" 

else  mark  '(B-C)  Is  mal functioning '33; 

fall:  "case  1.1. 2. 2:  B  or  (B-D)  Is  malfunctioning" 

Ff I f  (C  -t->  A)  then  mark  'A  &  (A-C)  are  ok  and 

(A-D)  Is  malfunctioning' 
else  mark  'A  Is  malfunctioning  or 

(A-C)  &  (A-D)  are  malfunctioning'; 
Ji  (C  -t->  B)  then  mark  'B  &  (B-C)  are  ok  and 

(B-D)  Is  malfunctioning' 
else  mark  'B  Is  malfunctioning  or 

(B-C)  &  (B-D)  are  malfunctlonlng'33 

end" locate-1 .1 .2". 

anocedure  locate-i .2.1; 
case  (C  -t->  B)  fit 

pass:  "case  1.2. 1.1:  B  &  (B-C)  are  ok" 
case  (Be  -t->  D)  of 

pass:  "case  1.2. 1.1.1:  D  &  (B-D)  are  ok  and 

(D-U)  Is  malfunctioning" 

It  (Ac  -t->  D)  then  mark  '(A-D)  Is  ok* 

fiJLaft  mark  '(A-D)  Is  malfunctioning'; 
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fall:  "case  1.2. 1.1. 2:  D  Is  malfunctioning  or 

(B-0)  4  (D-U)  are  malfunctioning” 

It  (Ac  -+->  D)  then  mark  *D  4  (A-D)  are  ok  and 

(B-D)  4  <D-U>  are  malfunctioning* 
else  mark  *D  or  (A-D)  Is  malfunctioning* 

endwcase-1 .2.1 .1"; 

fall:  "case  1.2. 1.2:  B  or  (B-C)  Is  malfunctioning  and 

both  are  unusable  and  (B-D)  Is  also  unusable" 

case  (Ac  -t->  D)  fit 

pass:  "case  1.2. 1.2.1:  D  4  (A-D)  are  ok  and 

(D-U)  Is  malfunctioning" 

If  (Dac  -t->  B)  then  mark  *B  4  (B-D)  are  ok  and 
(B-C)  Is  malfunctioning. 

D,  (A-D),  B,  and  (B-D)  are  unusable* 
else  mark  *B  Is  malfunctioning  or 

(B-C)  4  (B-D)  are  malfunctioning*; 


fall:  "case  1.2. 1.2.2" 

mark  'D  Is  malfunctioning  or  (A-D)  4  (D-U)  are  malfunctioning. 
B,  D,  (B-C),  (A-D)  and  (B-D)  are  unusuable.* 
end"case  1.2. 1.2" 

end"locate-1 .2.1" 

procedure  locate-1 .2.2 
case  (C  -t->  B)  al 

pass:  "case  1. 2.2.1:  B  4  (B-C)  are  ok" 
case  (Be  -t->  D)  fit 

pass:  "case  1.2.2. 1.1:  D  4  (B-0)  are  ok  and 
(D-U)  Is  malfunctioning" 
it  (Dbc  -t->  A)  than  mark  »A  4  (A-D)  are  ok  and 
(A-C)  Is  malfunctioning. 

(A-D),  A,  and  (A-C)  are  unusuable; 

else  mark  'A  Is  malfunctioning  or 

(A-C)  4  (A-D)  are  malfunctioning'; 

fall:  "case  1.2.2. 1.2" 

mark  *D  Is  malfunctioning  or  (B-0)  4  (D-U)  are  malfunctioning' 
and  "case  1.2. 2.1"; 

fall:  "case  1.2.2.2:  B  or  (B-C)  Is  malfunctioning" 

—  The  system  Is  not  recoverable  — 


end  "locate-1 .2.2" 
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A. 2  SINGLE  USER  DUAL  COMPUTERS  WITH  SHARED  MEMORY 

The  fault  location  procedure  presented  under  A.1  above  Is  also  effective  for 
the  case  where  the  two  processors  have  shared  memory  (the  broken  line 
connection  In  Figure  5-1  Is  present).  The  notes  and  procedures  presented 
below  interpret  the  outcomes  of  the  procedures  of  A.1  for  the  case  of  shared 
memory. 

(1)  "case  1.1.1":  Preprocessor  A  passed  a  test  and  thus  it  Is  reasonable  to 
conclude  that  shared  memory  M  Is  ok. 

(2)  "case  1.1. 2.1":  M  Is  ok. 

(3)  "case  1.1. 2.2": 

IT  If  (C  -t->  A)  then  mark  'A  &  (A-C)  are  ok  and 
(A-D)  Is  malfunctioning  and 
M  is  ok' 

else  mark  'A  Is  malfunctioning  or 
(A-C)  &  (A-D)  are  malfunctioning'; 

If  (C  -t->  B)  then  mark  'B  &  (B-C)  are  ok  and 
(B-D)  Is  malfunctioning  and 
M  Is  ok' 

else  CCmark  'B  is  malfunctioning  or 

(B-C)  4  (B-D)  are  malfunctioning']!]; 

If  M  has  not  been  validated  then  mark  'M  may  be  malfunctioning']] 

(4)  "case  1.2.1":  M  Is  ok. 

(5)  "case  1. 2.2.1":  M  Is  ok. 

(6)  "case  1.2.2. 2":  M  may  be  malfunctioning. 

(7)  "case  2.1":  This  case  Is  the  same  as  case  1.2  except  for  exchanging  C 
with  D  and  A  with  B. 

(8)  "case  2.2":  (U  -t->  C)  ■  (U  -t->  D)  *  fall":  M's  status  Is  unknown. 


A.3  MULTIPLE  USER  SEGMENTED  COMPUTER  SYSTEMS. 

Reference  Figure  5-9.  The  fault  location  procedure  consists  of  three  phases 
that  are  described  In  the  following.  Additional  notations  Introduced  are 

MS  -  The  set  of  elements  Identified  as  malfunctioning 
WS  -  The  set  of  elements  validated  as  working 

procedure  phase 1 ;  "Identification  of  usable  main  processors" 

(<U1  -t->  C)#(U2  -t->  C),(U3  -t->  C»  fli 
(pass. pass, pass):  "WS  -  CC,(C-U1),(0-U2),(C-U3)]" 

mark  'C,  (0-U1),  (C-U2),  &  (C-U3)  are  ok'; 
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(pass, pass, fal I ):  "WS  =  HC,(C-U1),(C-U2)J;  MS  -  [(C-U3)D" 
mark  »C,  (C-U1),  4  (C-U2)  are  ok  and 
(C-U3)  Is  mal functioning1; 

(pass, fal 1, pass):  "WS  ■  [C,(C-U1),(C-U3>];  MS  *  [(C-U2>;]" 
mark  'C,  (C-U1),  4  (C-U3)  are  ok  and 
(C-U2)  Is  malfunctioning’; 

(pass, fal I, fal I ):  "WS  -  [C,(C-U1)J;  MS  «  C(C-U2),(C-U3)]M 
mark  *C  4  (OU1 )  are  ok  and 

(C-U2)  4  (C-U3)  are  mal functioning'; 

(fal I, pass, pass):  "WS  =  [C,(C-U2),(C-U3)D;  MS  =  [(C-UI)D" 
mark  'C,  (C-U2),  4  (C-U3)  are  ok  and 
(C-U1)  Is  malfunctioning’; 

(fal I , pass, fal I ) :  "WS  ■=  LC,(C-U2)D;  MS  «  [(C-U1 ) , (C-U3)]" 
mark  'C  4  (0112)  are  ok  and 

(OU1)  4  (01)3)  are  malfunctioning’; 

(fall, fall, pass):  "WS  *  CC,(OU3)D;  MS  =  [(OU1  >,  (OU2)D" 
mark  'C  4  (OU3)  are  ok  and 

(OU1)  4  (OU2)  are  malfunctioning'; 

(fal  I  ,fal  I ,  fal  I ) :  "MS  *  CC  or  [(OU1  ),(OU2),  (C-U3)U" 

mark  'C  or  C(OU1),<OU2),(OU3)D  Is  malfunctioning’ 

end"case": 

case  ((U1  -t->  D),(U2  -t->  D),  - 

—  same  as  above  except  that 

(1)  C  Is  replaced  by  0, 

(2)  WS  ■  C— J  Is  replaced  by  WSnew  =  WSold  +  C— 0,  and 

(3)  MS  =  L — ]  Is  replaced  by  MSnew  ■  MSold  +  C— — 

end"case" 


f 


end"phase1". 

If  neither  C  nor  0  can  be  used  by  any  user,  then  the  system  Is 
Irrecoverable  and  the  fault  location  procedure  stops. 

The  actions  taken  during  Phase  2  are  chosen  on  the  basis  of  the  results 
of  Phase  1.  The  possible  results  of  Phase  1  can  be  classified  Into  the 
following  cases: 

case  l.a:  Both  C  and  0  can  be  used  by  the  same  user. 

e.g..  Both  (U1  -t->  C)  and  (U1  -t->  D)  resulted  In  "pass". 

£ASfi  l.b:  There  Is  no  user  who  can  use  both  C  and  D  but  C  can 

be  used  by  one  user  Ul  and  D  can  be  used  by  another  user  Uj. 
e.g.,  ((Ul  -t->  C),(U1  -t->  D))  resulted  In  "(pass, fal I )" 
while  ((U2  -t->  C),(U2  -t->  0))  resulted  In  "(fal I, pass)". 

l.c:  Only  one  main  processor,  C  or  D,  cen  be  used  by  any  user. 

This  can  be  divided  Into  two  subcases. 
case  t.c.li  C  Is  usable  but  0  cannot  be  used  by  any  user. 
case  l.c. 2:  0  Is  usable  but  C  cannot  be  used  by  any  user. 
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case  l.d:  None  of  the  main  processors  are  usable. 

In  the  last  case  (case  l.d),  there  Is  no  Phase  2  actions.  In  other 
cases,  the  actions  taken  in  Phase  2  are  the  same  as  some  parts  of  the  fault 
location  procedure  described  In  Section  5.2.  The  details  of  the  Phase  2 
actions  are  as  follows: 

procedure  phase2;  "Diagnosis  of  processors" 
case  results-of-phasel  q± 

1 .a:  "This  corresponds  to  case  1.1  In  the  fault  location  procedure  In  Section 

5.2.  Using  the  same  procedure,  the  operability  of  processors  and  their 
Interconnections  can  be  obtained."; 

I.b:  "This  also  corresponds  to  case  1.1  in  the  fault  location  procedure  In 
Section  5.2  except  that  whenever  C  needs  to  communicate  with  a  user, 
e.g..  In  the  case  of  C  -t->  A,  U!  Is  Involved,  whereas  whenever  D  needs 
to  communicate  with  a  user,  UJ  is  Involved."; 

1.c.1:"Th!s  corresponds  to  case  1.2  In  the  fault  location  procedure  In  Section 

5.2.  Using  the  same  procedure,  the  operability  of  processors  and  their 
Interconnections  can  be  obtained."; 

1.c.2:"Thls  corresponds  to  case  2.1  In  the  fault  location  procedure  In  Section 

5.2.  Using  the  same  procedure,  the  operability  of  processors  and  their 
Interconnections  can  be  obtained"; 

l.d:  "The  system  Is  Irrecoverable."  stop 

end"case" 

and  "Phase  2". 

The  actions  taken  during  Phase  3  depend  on  the  results  of  Phase  2.  These  are 
classified  Into  the  following  cases. 

case  2. a:  Both  preprocessors  are  usable. 

case  2.b:  Only  one  preprocessor  Is  usable. 

case  2.b.1:  A  Is  usable  but  B  cannot  be  used  by  any  user. 

case  2.b.2:  B  Is  usable  but  A  cannot  be  used  by  any  user. 

case  2.c:  None  of  the  preprocessors  are  usable. 

Here  it  Is  assumed  that  a  preprocessor  can  tell  the  operability  of  a  source  by 
watching  If  readable  Information  comes  from  the  source.  Therefore,  A  -t->  SI 
means  that  A  observes  the  Information  coming  from  source  SI  and  then  makes  a 
report  on  the  status  of  SI. 
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The  details  of  Phase  3  are  as  follows: 

procedure  phase3;  "Diagnosis  of  sources" 

case  resu I +s-of-phase2  of 

2. a:  "A  &  B  are  usable" 

rr case  (CA  -+->  St),(B  -+->  $1>>  at 

(pass, pass):  mark  'SI,  (S1-A),  &  (S1-B)  are  ok'} 

(pass, fall):  mark  'SI  &  (S1-A)  are  ok  and 

(S1-B)  Is  malfunctioning'; 

(fall, pass):  mark  'SI  &  (S1-B)  are  ok  and 

(S1-A)  Is  malfunctioning'; 

(fall, fall):  mark  'SI  or  C(S1-A),(S1-B)3  Is  malfunctioning' 
end"case": 

case  ((A  -t->  S2),  - 

—  same  as  above  except  that  SI  Is  replaced  by  S2  — 
and"case"~P: 

2.b.1 :  "A  Is  usable" 

ITcaae  (A  -+->  SI)  of 

pass:  mark  'SI  &  (S1-A)  are  ok'; 

fall:  mark  'SI  or  (S1~A)  Is  malfunctioning' 

and"case": 

cau  (A  -t->  S2)  at 

pass:  mark  'S2  A  (S2-A)  are  ok'; 

fall:  mark  'S2  or  (S2-A)  Is  malfunctioning' 

end "case "33; 

2.b.2:  "B  Is  usable" 

H  —  same  as  In  case  2.b.1  except  that  A  Is  replaced  by  B  —  33) 
2.c:  "The  system  Is  Irrecoverable."  stop 
end"case" 
mii4"phase3". 


A. 4  EXTENSIONS  OF  THE  TECHNIQUES 

A. 4.1  Fault  Location  In  a  Reduced  Configuration 

After  a  mat  functioning  component  has  bean  located,  the  component  Is 
functionally  removed  from  the  system  and  the  rest  of  the  system  continues  to 
opiate.  The  removal  of  the  component  Is  recorded  in  the  system 

status  table.  If  another  fault  Is  detected  later,  a  slightly  modified  version 
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of  the  fault  location  procedure  described  earlier  Is 


The  modulation  Is  that  In  each  component  the!  test  Is  preceded  by 
examination  of  the  system  status  table  to  determine  ihether  the  component 
question  has  been  functionally  removed.  The  compoient  test  will 
If  the  component  has  not  been  removed,  and  then  ty»e 
procedures  will  be  used. 


fol lowed. 


an 
In 

follow  only 
previously  described 


A.4.2  Diagnostic  Information  Contained  In  Reports  from  Fault  Detectors 

It  Is  possible  to  skip  certain  steps  In  the  fault  location  procedure 
described  In  Section  5.4  by  exploiting  the  Information  contained  In  the  reports 
made  by  fault  detectors.  There  are  two  types  of  components  capable  of 
detecting  faults:  preprocessor  and  main  processor.  A  preprocessor  Is  capable 
of  telling  whether  a  source  Is  functioning  or  dead.  On  the  other  hand,  a  main 
processor  may  be  capable  of  telling  If  a  preprocessor  Is  dead  or  not. 
Moreover,  If  the  two  preprocessors  have  been  assigned  to  process  the  same 
data,  then  a  main  processor  should  be  able  to  detect  a  mismatch  between  the 
outputs  of  the  two  preprocessors.  In  all  these  cases,  a  preprocessor  which 
detected  a  fault  should  send  a  report  to  all  the  users. 

case  1:  Preprocessor  Y  reported  "Source  Si  is  dead". 

There  are  six  possible  paths  from  a  preprocessor  to  the  users. 
case  1.1:  The  other  preprocessor  Y*  made  the  same  report. 
conclusion:  Source  St  Is  Indeed  dead. 

case  1.2:  Preprocessor  Y*  did  not  make  the  report. 

conclusion:  Link  (Sl-Y),  preprocessor  Y',  or  all  the  paths 

from  Y*  to  users  are  malfunctioning. 

case  1.3:  The  report  from  Y  did  not  come  through  all  six  paths. 
conclusion:  There  are  malfunctioning  components  on  those 

paths  which  the  report  did  not  come  through. 

case  2:  Main  processor  Z  reported  "Preprocessor  Y  Is  dead". 

There  are  three  links  from  a  main  processor  to  the  users, 

case  2.1:  The  other  main  processor  Z*  made  the  same  report. 
conclusion:  Preprocessor  Y  Is  Indeed  dead. 

OUta.  2.2:  Main  processor  Z*  did  not  make  the  report. 

conclusion:  Link  (Y-Z),  main  processor  Z',  or  ell  the  links 

from  Z*  to  users  are  malfunctioning. 

£ fiift  2.3:  The  report  from  Z  did  not  come  through  all  three  links, 
conclusion:  The  links  which  the  report  did  not  come  through 

are  malfunctioning. 

qpse  3:  Main  processor  Z  reported  "The  outputs  of  the  two 

preprocessors  disagree". 
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case  3.1:  The  other  main  processor  Z'  made  the  same  report. 

conclusion:  At  least  one  of  the  preprocessors  Is  malfunctioning. 

case  3.2:  Main  processor  Z*  did  not  make  the  report. 

conclusion:  Main  processor  Z'  or  al I  the  links  from  Z’  to 

users  are  malfunctioning. 

case  3.3:  The  report  from  Z  did  not  come  through  all  three  links. 
conclusion:  The  links  which  the  report  did  not  come  through 
are  malfunctioning. 

The  knowledge  obtained  as  above  can  be  used  to  shorten  to  a  certain 
extent  the  fault  location  procedure  to  follow.  However,  It  will  Increase  the 
complexity  of  the  overall  fault  location  procedure  and  will  not  change  the 
worst-case  execution  time  of  the  location  procedure.  Therefore,  the  decision 
on  whether  to  exploit  the  Information  contained  In  the  fault  report  or  not 
should  be  made  with  a  consideration  of  the  operational  mode  of  the  network 
(e.g.,  dual  redundant  operation,  concurrent  processing  of  different  data, 
etc.),  time  constraints,  logical  complexity  constraints,  etc. 

The  fault  location  procedures  described  earlier  do  not  distinguish  a 
malfunctioning  processor  from  an  Isolated  processor.  For  example,  failures  of 
(A-D),  (B-C),  and  (D-U)  will  make  B,  D,  and  (B-D)  useless.  Such  finer 
resolution  as  distinguishing  a  malfunctioning  processor  from  an  Isolated 
processor  cannot  be  obtained  without  adding  some  links  to  the  system. 
However,  such  additional  Information  Is  useful  primarily  for  the  maintenance 
action  (whether  to  repair  or  a  processor)  rather  than  for  fault  tolerance  and 
run-time  reconfiguration. 
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